Cloud Dataproc is a fast, easy-to-use, fully-managed cloud service for running Apache Spark and Apache Hadoop clusters in a simpler, more cost-efficient way. Operations that used to take hours or days take seconds or minutes instead. Create Cloud Dataproc clusters quickly and resize them at any time, so you don't have to worry about your data pipelines outgrowing your clusters.

This lab shows you how to use gcloud on the Google Cloud to create a Google Cloud Dataproc cluster, run a simple Apache Spark job in the cluster, then modify the number of workers in the cluster.

Create a cluster

In Cloud Shell, run the following command to set the Region:

gcloud config set dataproc/region global

Run the following command to create a cluster called example-cluster with default Cloud Dataproc settings:

gcloud dataproc clusters create example-cluster

If asked to confirm a zone for you cluster. Enter Y.

Your cluster will build for a couple of minutes.

Waiting for cluster creation operation...done.
Created [... example-cluster]

When you see a "Created" message, you're ready to move on.

Submit a job

Run this command to submit a sample Spark job that calculates a rough value for pi:

gcloud dataproc jobs submit spark --cluster example-cluster \\
  --class org.apache.spark.examples.SparkPi \\
  --jars file:///usr/lib/spark/examples/jars/spark-examples.jar -- 1000

The command specifies:

That you want to run a spark job on the example-cluster cluster
The class containing the main method for the job's pi-calculating application
The location of the jar file containing your job's code
The parameters you want to pass to the job—in this case, the number of tasks, which is 1000

Parameters passed to the job must follow a double dash (--). See the gcloud documentation for more information.