A basic Spark/Python script
This post is part of my preparation series for the Cloudera CCA175 exam, “Certified Spark and Hadoop Developer”. It is intentionally concise, to serve me as a cheat sheet.
This post serves as a brief introduction to Spark with Python. You can follow along and paste these lines into a pyspark
shell.
# Create some data as an RDD:
# Each entry consists of a tuple (weekday, gas_price)
myRDD = sc.parallelize([
("Monday", 1.24),
("Tuesday", 1.42),
("Sunday", 1.33),
("Sunday", 1.21),
("Monday", 1.18)
])
You can now transform the gas price from Euro per liter to USD per gallon, assuming 1.18 USD per Euro and 0.265 gallons per liter. We map the values to a tuple, the first element being the transformed price, and the second element a constant 1, which is helpful for the reduceByKey
step later on:
myRDD_USD = myRDD.mapValues(lambda price: (price * 1.18 / 0.265, 1))
View your data set at any time by issuing .collect()
on an RDD:
myRDD_USD.collect()
Now compute an RDD containing the average gas price per weekday and show its contents. The reduceByKey
function is a lambda that takes two arguments, which correspond to two different keys (not the two elements of the tuple that makes one key). So, you will create a reduced RDD that contains the sum of all prices in the first element, and the number of prices in the second, then map it again to compute the actual average:
price_per_weekday = myRDD_USD. \
reduceByKey(lambda key1,key2: (key1[0]+key2[0], key1[1]+key2[1])). \
mapValues(lambda x: x[0]/x[1])
ppw_list = price_per_weekday.collect()
for day in ppw_list:
print(day)
And your result is a python list which contains the weekday and the corresponding average price in each element!