Cassandra

Cassandra is a NoSQL technology that favors availability over consistency, and is thus an eventually consistent database. There is no master node in a Cassandra database: Every node has the same software installed and provides the same service. They communicate with each other to handle data partitioning and backups.

You can tune your read requests regarding consistency: You’d send your request and additionally require that, e.g., at least 3 nodes agree on that value before it’s returned.

Cassandra provides a query language called CQL. Really, it’s just an API for doing reads and writes based on primary keys. You can’t do join operations, so there is no way to do lookups between two tables. This means your data must be stored de-normalized. All operations must be performed on a primary key.

Databases in Cassandra are called keyspaces, which in turn consist of tables. This is just a terminology thing, though.

The Company DataStax provides a free connector between Spark and Cassandra. It transforms Cassandra tables to DataFrames and back, which enables you to do large-scale, complex analyses and/or data manipulations and transactions using Spark.

Using CQL

This is a short example that creates a keyspace (i.e., a database) named movielens in your Cassandra installation and creates a table:

[root@sandbox root]# cqlsh --cqlversion="3.4.0"
Connected to Test Cluster at 127.0.0.1:9042.
[cqlsh 5.0.1 | Cassandra 3.0.9 | CQL spec 3.4.0 | Native protocol v4]
Use HELP for help.
cqlsh> CREATE KEYSPACE movielens WITH replication = {'class': 'SimpleStrategy', 'replication_factor':'1'} AND durable_writes = true;
cqlsh> USE movielens;
cqlsh:movielens> CREATE TABLE users (user_id int, age int, gender text, occupation text, zip text, PRIMARY KEY (user_id));
cqlsh:movielens> SELECT * FROM users;
 
 user_id | age | gender | occupation | zip
---------+-----+--------+------------+-----
 
(0 rows)
cqlsh:movielens>

You have now created a table called users in Cassandra!

Next, we can populate it with the “users” table from the ml-100k data set. To do this, one way is via Spark, with the following Python script:

from pyspark.sql import SparkSession
from pyspark.sql import Row
from pyspark.sql import functions
 
def parseInput(line):
    fields = line.split('|')
    return Row(user_id = int(fields[0]), age = int(fields[1]), gender = fields[2], occupation = fields[3], zip = fields[4])
 
if __name__ == "__main__":
    # Create a SparkSession
    spark = SparkSession.builder.appName("CassandraIntegration").config("spark.cassandra.connection.host", "127.0.0.1").getOrCreate()
 
    # Get the raw data
    lines = spark.sparkContext.textFile("hdfs:///user/maria_dev/ml-100k/u.user")
    # Convert it to a RDD of Row objects with (userID, age, gender, occupation, zip)
    users = lines.map(parseInput)
    # Convert that to a DataFrame
    usersDataset = spark.createDataFrame(users)
 
    # Write it into Cassandra
    usersDataset.write\
        .format("org.apache.spark.sql.cassandra")\
        .mode('append')\
        .options(table="users", keyspace="movielens")\
        .save()
 
    # Read it back from Cassandra into a new Dataframe
    readUsers = spark.read\
    .format("org.apache.spark.sql.cassandra")\
    .options(table="users", keyspace="movielens")\
    .load()
 
    readUsers.createOrReplaceTempView("users")
 
    sqlDF = spark.sql("SELECT * FROM users WHERE age < 20")
    sqlDF.show()
 
    # Stop the session
    spark.stop()

You can execute this script with the following command:

spark-submit --packages datastax:spark-cassandra-connector:2.0.0-M2-s_2.11 CassandraSpark.py

The result should be (after a lot of messages and debug output) the first 20 rows of all users under 20, but extracted from Cassandra!

+-------+---+------+-------------+-----+
|user_id|age|gender|   occupation|  zip|
+-------+---+------+-------------+-----+
|    588| 18|     F|      student|93063|
|     30|  7|     M|      student|55436|
|    528| 18|     M|      student|55104|
|    674| 13|     F|      student|55337|
|    375| 17|     M|entertainment|37777|
|    851| 18|     M|        other|29646|
|    859| 18|     F|        other|06492|
|    813| 14|     F|      student|02136|
|     52| 18|     F|      student|55105|
|    397| 17|     M|      student|27514|
|    257| 17|     M|      student|77005|
|    221| 19|     M|      student|20685|
|    368| 18|     M|      student|92113|
|    507| 18|     F|       writer|28450|
|    462| 19|     F|      student|02918|
|    550| 16|     F|      student|95453|
|    289| 11|     M|         none|94619|
|    521| 19|     M|      student|02146|
|    281| 15|     F|      student|06059|
|     36| 19|     F|      student|93117|
+-------+---+------+-------------+-----+
only showing top 20 rows

Cassandra

Using CQL

My projects

Categories