In this lab, you will learn how to create a streaming pipeline using one of Google's Cloud Dataflow templates. More specifically, you will use the Cloud Pub/Sub to BigQuery template, which reads messages written in JSON from a Pub/Sub topic and pushes them to a BigQuery table. You can find the documentation for this template here.

You'll be given the option to use the Cloud Shell command line or the Cloud Console to create the BigQuery dataset and table. Pick one method to use, then continue with that method for the rest of the lab. If you want experience using both methods, run through this lab a second time.

Create a Cloud BigQuery Dataset and Table Using Cloud Shell

Let's first create a BigQuery dataset and table.

Note: This section uses the bq command-line tool. Skip down if you want to run through this lab using the console.

Run the following command to create a dataset called taxirides:

bq mk taxirides

Your output should look similar to:

Dataset '<myprojectid:taxirides>' successfully created

Now that you have your dataset created, you'll use it in the following step to instantiate a BigQuery table. Run the following command to do so:

bq mk \\
--time_partitioning_field timestamp \\
--schema ride_id:string,point_idx:integer,latitude:float,longitude:float,\\
timestamp:timestamp,meter_reading:float,meter_increment:float,ride_status:string,\\
passenger_count:integer -t taxirides.realtime

Your output should look similar to:

Table 'myprojectid:taxirides.realtime' successfully createdcontent_copy

On it's face, the bq mk command looks a bit complicated. However, with some assistance from the BigQuery command-line documentation, we can break down what's going on here. For example, the documentation tells us a little bit more about schema:

In this case, we are using the latter—a comma-separated list.

Create a storage bucket

Now that we have our table instantiated, let's create a bucket. Run the following commands to do so:

export BUCKET_NAME=<your-unique-name>