Cloud Dataproc is a fast, easy-to-use, fully-managed cloud service for running Apache Spark and Apache Hadoop clusters in a simpler, more cost-efficient way. Operations that used to take hours or days take seconds or minutes instead. Create Cloud Dataproc clusters quickly and resize them at any time, so you don't have to worry about your data pipelines outgrowing your clusters.
This lab shows you how to use gcloud on the Google Cloud to create a Google Cloud Dataproc cluster, run a simple Apache Spark job in the cluster, then modify the number of workers in the cluster.
In Cloud Shell, run the following command to set the Region:
gcloud config set dataproc/region global
Run the following command to create a cluster called example-cluster
with default Cloud Dataproc settings:
gcloud dataproc clusters create example-cluster
If asked to confirm a zone for you cluster. Enter Y.
Your cluster will build for a couple of minutes.
Waiting for cluster creation operation...done.
Created [... example-cluster]
When you see a "Created" message, you're ready to move on.
Run this command to submit a sample Spark job that calculates a rough value for pi:
gcloud dataproc jobs submit spark --cluster example-cluster \\
--class org.apache.spark.examples.SparkPi \\
--jars file:///usr/lib/spark/examples/jars/spark-examples.jar -- 1000
The command specifies:
example-cluster
clusterclass
containing the main method for the job's pi-calculating application1000
Parameters passed to the job must follow a double dash (--). See the gcloud documentation for more information.