This tutorial illustrates different ways to create and submit a Spark Scala job to a Dataproc cluster, including how to:
- write and compile a Spark Scala "Hello World" app on a local machine from the command line using the Scala REPL (Read-Evaluate-Print-Loop or interactive interpreter) or the SBT build tool
- package compiled Scala classes into a jar file with a manifest
- submit the Scala jar to a Spark job that runs on your Dataproc cluster
- examine Scala job output from the Google Cloud console
This tutorial also shows you how to:
write and run a Spark Scala "WordCount" mapreduce job directly on a Dataproc cluster using the
spark-shell
REPLrun pre-installed Apache Spark and Hadoop examples on a cluster
Set up a Google Cloud Platform project
If you haven't already done so:
Write and compile Scala code locally
As a simple exercise for this tutorial, write a "Hello World" Scala app using the Scala REPL or the SBT command line interface locally on your development machine.
Use Scala
- Download the Scala binaries from the Scala Install page
Unpack the file, set the
SCALA_HOME
environment variable, and add it to your path, as shown in the Scala Install instructions. For example:export SCALA_HOME=/usr/local/share/scala export PATH=$PATH:$SCALA_HOME/
Launch the Scala REPL
$ scala Welcome to Scala version ... Type in expressions to have them evaluated. Type :help for more information. scala>
Copy and paste
HelloWorld
code into Scala REPLobject HelloWorld { def main(args: Array[String]): Unit = { println("Hello, world!") } }
Save
HelloWorld.scala
and exit the REPLscala> :save HelloWorld.scala scala> :q
Compile with
scalac
$ scalac HelloWorld.scala
List the compiled
.class
files$ ls HelloWorld*.class HelloWorld$.class HelloWorld.class
Use SBT
Create a "HelloWorld" project, as shown below
$ mkdir hello $ cd hello $ echo \ 'object HelloWorld {def main(args: Array[String]) = println("Hello, world!")}' > \ HelloWorld.scala
Create an
sbt.build
config file to set theartifactName
(the name of the jar file that you will generate, below) to "HelloWorld.jar" (see Modifying default artifacts)echo \ 'artifactName := { (sv: ScalaVersion, module: ModuleID, artifact: Artifact) => "HelloWorld.jar" }' > \ build.sbt
Launch SBT and run code
$ sbt [info] Set current project to hello ... > run ... Compiling 1 Scala source to .../hello/target/scala-.../classes... ... Running HelloWorld Hello, world! [success] Total time: 3 s ...
Package code into a jar file with a manifest that specifies the main class entry point (
HelloWorld
), then exit> package ... Packaging .../hello/target/scala-.../HelloWorld.jar ... ... Done packaging. [success] Total time: ... > exit
Create a jar
Create a jar file
with SBT
or using the jar
command.
Create a jar with SBT
The SBT package command creates a jar file (see Use SBT).
Create a jar manually
- Change directory (
cd
) into the directory that contains your compiledHelloWorld*.class
files, then run the following command to package the class files into a jar with a manifest that specifies the main class entry point (HelloWorld
).$ jar cvfe HelloWorld.jar HelloWorld HelloWorld*.class added manifest adding: HelloWorld$.class(in = 637) (out= 403)(deflated 36%) adding: HelloWorld.class(in = 586) (out= 482)(deflated 17%)
Copy jar to Cloud Storage
- Use the Google Cloud CLI to copy the jar to a Cloud Storage bucket in your project
$ gcloud storage cp HelloWorld.jar gs://<bucket-name>/ Copying file://HelloWorld.jar [Content-Type=application/java-archive]... Uploading gs://bucket-name/HelloWorld.jar: 1.46 KiB/1.46 KiB
Submit jar to a Dataproc Spark job
Use the Google Cloud console to submit the jar file to your Dataproc Spark job. Fill in the fields on the Submit a job page as follows:
- Cluster: Select your cluster's name from the cluster list
- Job type: Spark
Main class or jar: Specify the Cloud Storage URI path to your HelloWorld jar (
gs://your-bucket-name/HelloWorld.jar
).If your jar does not include a manifest that specifies the entry point to your code ("Main-Class: HelloWorld"), the "Main class or jar" field should state the name of your Main class ("HelloWorld"), and you should fill in the "Jar files" field with the URI path to your jar file (
gs://your-bucket-name/HelloWorld.jar
).
Click Submit to start the job. Once the job starts, it is added to the Jobs list.
Click the Job ID to open the Jobs page, where you can view the job's driver output.
Write and run Spark Scala code using the cluster's spark-shell
REPL
You may want to develop Scala apps directly on your Dataproc cluster. Hadoop and Spark are pre-installed on Dataproc clusters, and they are configured with the Cloud Storage connector, which allows your code to read and write data directly from and to Cloud Storage.
This example shows you how to SSH into your project's Dataproc cluster master node, then use the spark-shell REPL to create and run a Scala wordcount mapreduce application.
SSH into the Dataproc cluster's master node
Go to your project's Dataproc Clusters page in the Google Cloud console, then click on the name of your cluster.
On the cluster detail page, select the VM Instances tab, then click the SSH selection that appears at the right your cluster's name row.
A browser window opens at your home directory on the master node
Launch the
spark-shell
$ spark-shell ... Using Scala version ... Type in expressions to have them evaluated. Type :help for more information. ... Spark context available as sc. ... SQL context available as sqlContext. scala>
Create an RDD (Resilient Distributed Dataset) from a Shakespeare text snippet located in public Cloud Storage
scala> val text_file = sc.textFile("gs://pub/shakespeare/rose.txt")
Run a wordcount mapreduce on the text, then display the
wordcounts
resultscala> val wordCounts = text_file.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey((a, b) => a + b) scala> wordCounts.collect ... Array((call,1), (What's,1), (sweet.,1), (we,1), (as,1), (name?,1), (any,1), (other,1), (rose,1), (smell,1), (name,1), (a,2), (would,1), (in,1), (which,1), (That,1), (By,1))
Save the counts in
<bucket-name>/wordcounts-out
in Cloud Storage, then exit thescala-shell
scala> wordCounts.saveAsTextFile("gs://<bucket-name>/wordcounts-out/") scala> exit
Use the gcloud CLI to list the output files and display the file contents
$ gcloud storage ls gs://bucket-name/wordcounts-out/ gs://spark-scala-demo-bucket/wordcounts-out/ gs://spark-scala-demo-bucket/wordcounts-out/_SUCCESS gs://spark-scala-demo-bucket/wordcounts-out/part-00000 gs://spark-scala-demo-bucket/wordcounts-out/part-00001
Check
gs://<bucket-name>/wordcounts-out/part-00000
contents$ gcloud storage cat gs://bucket-name/wordcounts-out/part-00000 (call,1) (What's,1) (sweet.,1) (we,1) (as,1) (name?,1) (any,1) (other,1)
Running Pre-Installed Example code
The Dataproc master node contains runnable jar files with standard Apache Hadoop and Spark examples.
Jar Type | Master node /usr/lib/ location |
GitHub Source | Apache Docs |
---|---|---|---|
Hadoop | hadoop-mapreduce/hadoop-mapreduce-examples.jar |
source link | MapReduce Tutorial |
Spark | spark/lib/spark-examples.jar |
source link | Spark Examples |
Submitting examples to your cluster from the command line
Examples can be submitted from your local development machine using the Google Cloud CLI gcloud
command-line tool (see
Using the Google Cloud console
to submit jobs from the Google Cloud console).
Hadoop WordCount example
gcloud dataproc jobs submit hadoop --cluster=cluster-name \ --region=region \ --jars=file:///usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar \ --class=org.apache.hadoop.examples.WordCount \ -- URI of input file URI of output file
Spark WordCount example
gcloud dataproc jobs submit spark --cluster=cluster-name \ --region=region \ --jars=file:///usr/lib/spark/examples/jars/spark-examples.jar \ --class=org.apache.spark.examples.JavaWordCount \ -- URI of input file
Shutdown your cluster
To avoid ongoing charges, shutdown your cluster and delete the Cloud Storage resources (Cloud Storage bucket and files) used for this tutorial.
To shutdown a cluster:
gcloud dataproc clusters delete cluster-name \ --region=region
To delete the Cloud Storage jar file:
gcloud storage rm gs://bucket-name/HelloWorld.jar
You can delete a bucket and all of its folders and files with the following command:
gcloud storage rm gs://bucket-name/ --recursive