Cassandra is a NoSQL technology that favors availability over consistency, and is thus an eventually consistent database. There is no master node in a Cassandra database: Every node has the same software installed and provides the same service. They communicate with each other to handle data partitioning and backups.
You can tune your read requests regarding consistency: You’d send your request and additionally require that, e.g., at least 3 nodes agree on that value before it’s returned.
Cassandra provides a query language called CQL. Really, it’s just an API for doing reads and writes based on primary keys. You can’t do join operations, so there is no way to do lookups between two tables. This means your data must be stored de-normalized. All operations must be performed on a primary key.
Databases in Cassandra are called keyspaces, which in turn consist of tables. This is just a terminology thing, though.
The Company DataStax provides a free connector between Spark and Cassandra. It transforms Cassandra tables to DataFrames and back, which enables you to do large-scale, complex analyses and/or data manipulations and transactions using Spark.
Using CQL
This is a short example that creates a keyspace (i.e., a database) named movielens in your Cassandra installation and creates a table:
You have now created a table called users in Cassandra!
Next, we can populate it with the “users” table from the ml-100k data set. To do this, one way is via Spark, with the following Python script:
You can execute this script with the following command:
The result should be (after a lot of messages and debug output) the first 20 rows of all users under 20, but extracted from Cassandra!
My projects
This list contains "mother" posts for larger topics, each spanning multiple blog posts.