Hive
Hive can make your Hadoop cluster look like a relational database. It is layered on top of MapReduce or Tez, and translates your SQL queries to MapReduce jobs.
The language itself is called HiveQL. It’s very similar to MySQL, with some extensions (e.g., views). HiveQL is highly optimized, and very extensible by things like user defined functions, Thrift servers, and it exposes JDBC/ODBC drivers, so that it can look like any other database.
Hive also has a few disadvantages: It’s not useful for OLTP (i.e. high-throughput INSERT, UPDATE, or DELETE transactions) because the MapReduce jobs have some overhead cost. Also, SQL is a limited language. Pig and Spark are more appropriate for more complicated tasks.
Importing tables
If you have a data file on your hard disk (for example a .csv file) or on HDFS, you can import it into a table in one of two ways: through a UI such as Ambari, which is more convenient, or through SQL commands. Just for illustration purposes, this is how you would import the ml-100k data set from HDFS into Hive:
The command LOAD DATA LOCAL
(as opposed to LOAD DATA
) will copy the data into Hive. For actual big data sets, you may want to move instead of copy it. Then, of course, a DROP TABLE
command will delete your data, so be careful!
If you want Hive to work with data, but not “own” the data, you can create an external table (as opposed to a managed table). Hive stores the metadata (column names and types, e.g.), but the data is still on HDFS and accessible by other tools.
Querying tables
I previously said that SQL is a somewhat limited tool for data analysis. But: you can use views to store results of a query in an intermediate, virtual table. In the following SQL statement, you can then access that view the same way you would access a “real” table. For example, in this script you generate the view topMovieIDs which contains the movie ID and the number of ratings, then in a subsequent statement you join the movie name to it and select only the name and the rating count in your final result:
You can run this script from the Ambari UI, or straight from the command line. There, the result will look like this:
Nice, huh? You get some diagnostic output, and the top 10 movies according to the number of ratings. Not surprisingly, Star Wars is the number one here.
If you run this from the Ambari UI, you get a HTML formatted table as your output, but you can download the results as a .csv table, as well. This is useful for further analyses or visualization of your results.
Other nice things about Hive
- You can assign a partitioning to your data if you frequently analyze only specific partitions. If, for example, you have shopping data of many business units, and most your analyses are only specific to a given business unit, this operation will provide a significant speed-up to your queries.
- You can have structures as column types. For example, to have
address
as a structure containing street, city, and zip code, you would issue a query like
The fields are then accessed by address.city
etc.
- You can execute a batch Hive script from the command line by
hive -f /some/path/queries.hql
- Importing and exporting to/from other relational databases such as MySQL, Oracle, or PostgreSQL works with Sqoop, another tool in the Hadoop ecosystem. If your analyzed and processed output is not “big” anymore, you can export that to the standalone databases for other people to work with.