On Spark, you can very flexible write scripts for data processing. It provides APIs in Java, Python, or Scala. Writing in Python is easier, but the programs will be slower and a little less reliable. Spark itself is written in Scala, so this language is recommended. Scala is very similar to Python, so the transition is not too hard. Its flexibility allows you to run advanced analyses, such as iterative algorithms and interactive analyses.
The most important difference to other technologies such as Hive is that Spark has a rich ecosystem on top of it, such as Spark Streaming (for real-time data), Spark SQL (a SQL interface to write SQL queries and functions), MLLib (a library to run MapReduce versions ofmachine learning algorithms on a dataset in Spark), and GraphX (analyzing graphs, for example social networks).
Spark can run on Hadoop, but it doesn’t have to. You can even choose other cluster managers where Spark can run on.
Spark is very fast due to two main reasons: It is not disk-based, but memory-based. It tries to keep as much as possible in RAM, even between jobs. The second factor that contributes to its 10-100x speedup compared to the standard Hadoop MapReduce is its use of directed acyclic graphs (more on this later) to optimize workflows.
Resilient Distributed Datasets (RDDs)
The central concept in Spark is the Resilient Distributed Dataset, or RDD. They are an abstraction for a large dataset. In a typical Spark application, you load one or more RDDs, then perform transformations on them to create a target RDD, which you then store on your file system.
You can create RDDs to experiment from the Spark shell, or load files from your file system, the HDFS, or from HBase. You can even load data from Amazon’s S3 service with the s3n:// protocol. Also, RDDs can be created from any database that is connected to JDBC.
Transforming RDDs can be done through commands such as map, flatmap, filter, distinct, sample, or set operations like union. These look similar to Pig, but Spark is a lot more powerful.
Reducing RDDs can be done by so-called actions, for example collect, count, countByValue, take or reduce.
Spark uses lazy evaluation, so it only executes a query when it hits an action. Then, it can go backward and work out the most efficient way of wrangling the data.
An example Spark 1.0 script
Here is a Spark script (in Python) that analyzes the MovieLens dataset I worked with in the other posts here. It aggregates the average ratings for each movie and prints out the 10 worst rated movies.
Note that this is the Spark 1 way of doing it. A script in Spark 2 will be a bit more simple.
To run it, don’t invoke Python directly, but use the spark-submit command:
This should give you some output, as well as the 10 worst movies in this database:
I loved “3 Ninjas” as a kid!
Spark 2.0, Spark SQL, DataFrames and DataSets
Spark 2.0 extends RDDs to a “DataFrame” object. These DataFrames contain row objects which contain column types and names. You can run SQL queries (the Hadoop community seems to love SQL), read and write to JSON or Hive, and communicate with JDBC/ODBC and even Tableau.
In Python, you can read a DataFrame object from Hive by HiveContext(), or a JSON file by spark.read.json(), and then run SQL commands on them.
You can even perform many operations directly in Python, for example by issuing myDataFrame.groupBy(myDataFrame("movieID")).mean(). This would compute column means per movie ID. Very handy, isn’t it?
To convert a DataFrame object back to its underlying RDD structure, simply call myDataFrame.rdd(), and then work on the RDD level, e.g. myDataFrame.rdd().map(myMapperFunction).
A DataFrame is really a DataSet of Row objects. The more general DataSet class can hold more than Row objects. This allows Spark to know more about the data structure upfront, and provides advantages such as giving compile-time (as opposed to run-time) errors when writing in Java or Scala. In Spark 2.0, it is recommended to use DataSets instead of DataFrames where possible. Additionally to providing some speed-ups, they are the unified API between the subsystems (e.g. the machine learning library) of Spark 2.
Extending Spark SQL with User-defined functions (UDFs)
It is possible to write Python functions and use them within SQL calls! This example squares a column while extracting it:
Revisiting the worst rated movies in Spark 2.0
Coming back to the Spark 1 script from above, we’ll now reimplement it in Spark 2.0.
Instead of a SparkContext, you now create a SparkSession. This encompasses both a Spark context and a SQL context. The parseInput() function does not return a tuple, but a resilient Row object that has column types and column names. The line that creates averageRatings is very elegant and shows the advantage of Spark 2.0 over RDDs here:
We also computed the rating counts and then joined the two tables together. This was done from Python, but we could have also used SQL commands. The output now shows you the average rating of the worst movies, plus the number of people who rated the movie.
Using MLLib on Spark
The previous examples are rather simple, and could have been implemented in Pig or Hive, as well. The power of Spark unleashes with more complex analyses. The machine learning library (MLLib) that’s built on top of Spark enables you to do exactly these complex analyses. Implementing something like alternating least squares (ALS) in pure MapReduce would be insanely harder.
In the following script, you train the ALS algorithm on the movie ratings data set, and create a “dummy user” with the UID 0. He has rated Star Wars and The Empire Strikes Back (movie IDs 50 and 172) with 5 stars, and Gone With the Wind (movie ID 133) with 1 star. I used Hive from Ambari to input these three rows:
Using this information, the following script then predicts his ratings for all the other movies.
Don’t forget to export SPARK_MAJOR_VERSION=2 again, and run this Spark job as before. Here is the output of this script:
Very advanced stuff, and relatively simple to run thanks to MLLib!
My projects
This list contains "mother" posts for larger topics, each spanning multiple blog posts.