Reading and writing data with Spark and Python
This post is part of my preparation series for the Cloudera CCA175 exam, “Certified Spark and Hadoop Developer”. It is intentionally concise, to serve me as a cheat sheet.
In this post, I’ll briefly summarize the core Spark functions necessary for the CCA175 exam. I also have a longer article on Spark available that goes into more detail and spans a few more topics.
The Spark context (often named sc
) has methods for creating RDDs and is responsible for making RDDs resilient and distributed.
Reading data
This is how you would use Spark and Python to create RDDs from different sources:
Note that you cannot run this with your standard Python interpreter. Instead, you use spark-submit
to submit it as a batch job, or call pyspark
from the Shell.
Other file sources include JSON, sequence files, and object files, which I won’t cover, though.
Writing data
The RDD class has a saveAsTextFile
method. However, this saves a string representation of each element. In Python, your resulting text file will contain lines such as (1949, 111)
.
If you want to save your data in CSV or TSV format, you can either use Python’s StringIO
and csv_modules
(described in chapter 5 of the book “Learning Spark”), or, for simple data sets, just map each element (a vector) into a single string, e.g. like this:
Use metastore tables as input and output
Metastore tables store meta information about your stored data, such as the HDFS path to a table, column names and types. The HiveContext
inherits from SQLContext
and can find tables in the Metastore:
To write data from Spark into Hive, do this:
These HiveQL commands of course work from the Hive shell, as well.
You can then load data from Hive into Spark with commands like
To write data from Spark into Hive, you can also transform it into a DataFrame and use this class’s write
method:
Hive tables, by default, are stored in the warehouse at /user/hive/warehouse
. This directory contains one folder per table, which in turn stores a table as a collection of text files.
Interacting with the Hive Metastore
This requirement for the CCA175 exam is a fancy way of saying “create and modify Hive tables). The metastore holds meta information about your tables, i.e. column names, numbers, and types.
So the requirement here is to get familiar with the CREATE TABLE
and DROP TABLE
commands from SQL. (ALTER TABLE
does not work from within SPARK and should be done from beeline
).