NoSQL: non-relational databases
NoSQL is a way of quickly answering simple questions on big data sets. Things such as Amazon asking what articles a certain customer bought in the last year are simply not manageable with a standard relational data base management system (RDBMS) like MySQL. A join operation on customers and orders, for example, is an expensive operation in a large database. In a RDBMS, you would have to apply some shady tricks to manage these tasks, for example denormalization of your data, i.e. storing the customer’s name and address redundantly in your orders table.
Maybe you don’t need the rich query language that SQL offers. More often than not, you just need a simple API that allows you to get or put some information (key/value data stores). In that case, you don’t need a relational database, but a more scalable system that can be horizontally scaled by key ranges.
And again, for more complex queries, you can still use Hive, Spark, and all the other tools. A NoSQL database, however, enables you to perform high-throughput transactions, e.g. storing all customer transactions of a large online shop.
It boils down to this: Use the right tool for the job. For big analytic queries like business reports, demographics, etc., you can use Hive, Pig, Spark, Tableau, and all these high-level tools. For smaller data sets, MySQL might be a better choice for storing this data. Only if you work at giant scale, e.g. serving transaction data to a web application, then it makes sense to export your data to a non-relational database.
There are many well-known NoSQL technologies, the most popular being:
- HBase is probably the most relevant one, since it’s a direct part of Hadoop. It’s built on top of HDFS and is a very fast and scalable transactional system to query your data and expose it to a webservice, a web app etc.
- Cassandra is cleverly designed in that it has no master node, i.e. no single point of failure. The data model is similar to BigTable and HBase. Cassandra is non-relational, so you still won’t do joins or normalizing data structures, but it does have a limited CQL language as an interface.
- MongoDB is the most popular NoSQL tool right now. It’s a document oriented database, storing its data in JSON format. Its advantages include strong consistency and an expressive query language.
The CAP theorem states that you can only have two out of the following three in a distributed data store:
- Partition-tolerance, i.e. the database can easily be split up and distributed across a cluster
- Availability, i.e. your database should be available all the time, even if, say, one of the nodes burns down.
- Consistency, i.e. when you write new data, it is immediately available to all nodes. This is critical for bank accounts, e.g., where your balance must be the same for each ATM. But in other cases, e.g. if you post something to Facebook, it is not a critical requirement that everyone sees it right away. The usual span until consistency is reached is in the range of seconds.
RDMBS like MySQL usually choose availability and consistency, and thus are not partition-tolerant. Now, in NoSQL, partition-tolerance is a non-negotiable requirement. So your choice is between availability and consistency. Apache’s HBase prefers consistency to availability. It has a single master node, that provides consistency across the nodes. But if that master node goes down, you have a problem. Some other NoSQL databases favor availability over consistency, which is why the eventual consistency thing exists. Cassandra, for example, has no master node, which provides high availability, but at the cost of immediate consistency across the nodes.
Choosing which database to use
This decision is made based on many things to be considered.
First, you should decide whether or not a non-relational database comes into question. If you have static data, you can just import it into HDFS an analyze it offline, e.g. with Spark. A NoSQL database only makes sense as soon as your data updates very frequently, or if you want to get answers to specific queries over and over again, with a very high frequency (e.g. from a web service). To summarize: If you do analytics internally, don’t set up a NoSQL database. If you want to serve the results of many queries to many people (think Google Analytics), then use a NoSQL database.
If you need a database, you should think about:
- Integration considerations: What tools already exist on your cluster? How easy would it be to connect the database to the existing technology? If you have a lot of existing SQL code, can the database understand SQL-type queries?
- Scaling requirements: How much data are we talking about? If your data is small enough to fit on one computer, a relational database is the better choice in many cases. How many transactions per second are happening?
- Support considerations: What in-house expertise do you have available? What are your security requirements? Does this database offer professional, paid support? Do I want to outsource the administration?
- Where on the CAP theorem do you find yourself and your use case’s requirements? Can you let go of availability, consistency, or partition-tolerance? Also note that the CAP theorem is no hard and fast rule anymore. For example, you can tune Cassandra to only return some value if a lot (say, 50%) of the nodes agree on the value.
- Simplicity: Don’t go for a giant NoSQL database if your dataset is small enough to fit into a RDBMS. To me, this is the guiding principle in software design. It applies here, in choosing an infrastructure, as well.