Lists

>>> x = [2, 4, 8, 16, 32, 64, 128, 256]
>>> x[3:]
[16, 32, 64, 128, 256]
>>> x[1:6:2]
[4, 16, 64]
>>> x.append(512)
>>> x
[2, 4, 8, 16, 32, 64, 128, 256, 512]

Dictionaries

>>> dict = {'car': 'Auto', 'boat': 'Boot', 'boot': 'Stiefel'}
>>> dict
{'car': 'Auto', 'boat': 'Boot', 'boot': 'Stiefel'}
>>> dict.items()
dict_items([('car', 'Auto'), ('boat', 'Boot'), ('boot', 'Stiefel')])
>>> dict[1]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError: 1
>>> dict['car']
'Auto'
>>> dict.keys()
dict_keys(['car', 'boat', 'boot'])
>>> dict.values()
dict_values(['Auto', 'Boot', 'Stiefel'])
>>> 'car' in dict
True
>>> 'Auto' in dict
False

Modules

  • Packages (e.g. pyspark) contain modules (e.g. pyspark.sql), which in turn define classes, functions, etc. (e.g. pyspark.sql.HiveContext(SparkContext)).
  • Load a package or a module by issuing import numpy. You can then access its functions by numpy.array([2,4,6]). If you want to type less, you can import numpy as np, which allows you to call np.array([2,4,6]).
  • You can hand-pick single functions with from numpy import array and can then directly call array([2,4,6]), but this is discouraged as it could lead to namespace clashes.
  • A possibility (although not recommended) is to use from numpy import *. This imports all definitions in numpy directly. Again, not recommended.

Functional programming

The following two versions are equivalent:

# Version 1: Anonymous function
 
rdd.map(lambda x: x*x)
 
# Version 2: Named function
 
def squareIt(x):
    return x*x
 
rdd.map(squareIt)