The Yhat Blog


machine learning, data science, engineering


ScienceCluster Meets Spark

by Elise Breda |


Getting Started

When did all the 'big data' hoopla start? By the very first definition, in a 1997 paper by scientists at NASA, a data set that is too big to fit on a local disk has officially graduated to big-data-dom.

Whether you're working with large excel files or processing the "10 terabytes generated by a 737 every 30 minutes during flight", having the right tool(s) for the job is extremely important. This is especially true when dealing with large, distributed datasets.

Enter Apache Spark.

What is Apache Spark?

When you have a job that is too big to process on a laptop or single server, Spark enables you to divide that job into more manageable pieces. Spark then runs these job pieces in-memory, on a cluster of servers to take advantage of the collective memory available on the cluster.

Apache Spark is an open source processing engine originally developed at UC Berkeley in 2009.

Spark is able to do this thanks to its Resilient Distributed Dataset (RDD) application programming interface (or API). If you want to know more about what happens under the Spark hood, check out this handy Cloudera guide on how Spark’s RDD API and the original Apache Mapper and Reducer API differ.

Spark has gained a lot of popularity in the big data world recently due to its lightning-fast computing speed and its wide array of libraries, including SQL and DataFrames, MLlib, GraphX and Spark Streaming. Given how useful and efficient Spark is for interactive queries and iterative big data processing, we decided it was time to invite Spark to the ScienceCluster party.

Welcome to the Party: Spark in ScienceCluster

ScienceCluster is Yhat’s enterprise workplace for data science teams to collaborate on projects and harness the power of distributed computing to allocate tasks across a cluster of servers. Now that we’ve added support for Spark to ScienceCluster, you can work either interactively or by submitting standalone Python jobs to Spark.


ScienceCluster is a data science platform developed by Yhat (that's us).

Working interactively means launching a Spark shell in Jupyter in order to prototype or experiment with Spark algorithms. If and when you decide to run your algorithm on your team’s Spark cluster, you can easily submit a standalone job from your Jupyter notebook and monitor its progress, without ever leaving ScienceCluster.

Interactive Spark Analysis with Bill Shakespeare

For this example, suppose that you are interested in performing a word count on a text document using spark interactively. You’ve selected to run this on the entire works of William Shakespeare, the man responsible for giving the English language words like swagger, scuffle and multitudinous, among others.

Shakespeare's plays also feature the first written instances of the phrases "wild goose chase" and "in a pickle."

Begin by uploading the complete works of William Shakespeare to ScienceCluster. Next, launch the interactive spark shell that is integrated right into the Jupyter IDE.

Launching an interactive spark shell in the Jupyter IDE.

This interactive shell works by running a Python-Spark kernel inside a container that a Jupyter notebook can communicate with. Each instance of a Jupyter notebook running the Python-Spark kernel gets its own container.

Next, add the following code to your Jupyter notebook:

  from __future__ import print_function
  from operator import add

  # Use the SparkContext sc here, see below.
  lines = sc.textFile("shakespeare.txt")

  counts = lines.flatMap(lambda x: x.split(' ')) \
                .map(lambda x: (x, 1)) \
                .reduceByKey(add)

  output = counts.collect()

  # Print the first 100 word counts as pairs
  for (word, count) in output[:100]:
      print("%s: %i" % (word, count))

Note that you do not need to instantiate the SparkContext. The PySpark kernel will take care of this for you. You must use the sc object as the SparkContext in our notebook session. The path that you use to instantiate an RDD textFile is just the filename. After you run this code in your notebook, you see the following output:

Output:

: 506499
fawn: 11
Fame,: 3
mustachio: 1
protested,: 1
sending.: 3
offendeth: 1
instant;: 1
scold: 4
Sergeant.: 1
nunnery: 1
Sergeant,: 2
swoopstake: 1

Running Standalone Spark Jobs

Once you have perfected our Shakespeare word count algorithm, let’s say that you decide that it needs to be run on your team’s Spark cluster.

The first step is to convert our notebook code into a standalone Python script. To convert your Shakespeare algorithm, you might borrow the wordcount.py file below from Spark’s examples.

from __future__ import print_function
import sys
from operator import add
from pyspark import SparkContext


if __name__ == "__main__":
    if len(sys.argv) != 2:
        print("Usage: wordcount <file>", file=sys.stderr)
        exit(-1)
    sc = SparkContext(appName="PythonWordCount")
    lines = sc.textFile(sys.argv[1], 1)
    counts = lines.flatMap(lambda x: x.split(' ')) \
                  .map(lambda x: (x, 1)) \
                  .reduceByKey(add)
    output = counts.collect()
    for (word, count) in output:
        print("%s: %i" % (word, count))

    sc.stop()

Note that it is best practice to stop a SparkContext after your code is finished running.

Next, upload your standalone script to ScienceCluster project. To run your Spark job from ScienceCluster indicate that “This is a Spark Job” on your job form and complete the Spark submission fields.

Once you’ve completed the job form, send your standalone job to your Spark Cluster by clicking Run. Your Spark Cluster logs will now stream back to ScienceCluster, accessible via the Running Jobs tab. You can either monitor your job in ScienceCluster to see when it is finished, or opt in to an email alert. Either way, rest you merry as Spark processes your Shakespearian query!

Running a Spark job on ScienceCluster

Tell Me More About ScienceCluster

ScienceCluster is an enterprise workplace for data science teams to collaborate on projects and harness the power of distributed computing to allocate tasks across a cluster of servers. Making it easy to work with Spark in ScienceCluster is just one of the ways that we’re empowering businesses to draw value from data science.

Reach out if you want to chat with us or head to our home page or products page if you want or learn more!



Our Products


Rodeo: a native Python editor built for doing data science on your desktop.

Download it now!

ScienceOps: deploy predictive models in production applications without IT.

Learn More

Yhat (pronounced Y-hat) provides data science solutions that let data scientists deploy and integrate predictive models into applications without IT or custom coding.