Using Sqoop to move data between HDFS and MySQL

This post is part of my preparation series for the Cloudera CCA175 exam, “Certified Spark and Hadoop Developer”.

Check MySQL

Before using Sqoop to import data from MySQL, you should connect to the database and make sure everything exists as expected. From the command line, access a MySQL on localhost via

mysql -h localhost -u root -p

You could enter a database name at the end, but you don’t have to. Once inside, you can poke around and check the database contents:

show databases;  -- list all available databases
use retail_db;   -- switch to this database
show tables;
select * from customers limit 10;  -- query the table

Sqoop needs permissions to access the tables. You grant these permissions from MySQL like so:

GRANT ALL PRIVILEGES ON <database_name>.* to ''@'localhost';

Import from MySQL to HDFS

The core command here is sqoop import, along with a lot of parameters. This is an extensive example command:

sqoop import \
  --connect "jdbc:mysql://quickstart.cloudera:3306/<database_name>" \
  --username <user> \
  --password <pass> \
  --table <table_name> \
  --compress \
  --compression-codec org.apache.hadoop.io.compress.SnappyCodec \
  --target-dir /user/cloudera/problem1/<target_folder> \
  --as-avrodatafile;

Additionally to importing a table, this command compresses the HDFS file using the snappy codec (alternatives to .SnappyCodec are BZip2Codec, GzipCodec, and DefaultCodec).

It also imports the file as an Avro file (instead of the default, a text file).

Advanced import commands

--fields-terminated-by '|' changes the field delimiter from the default \t to another one.
--hive-import can import the data directly into the Hive warehouse (instead of HDFS, the default). Use --hive-overwrite to replace an existing table.
--driver is only relevant if the connection string does not begin with jdbc:mysql://.
--null-non-string -99 to recode NULL values from the database into -99 for non-string data types.
--null-string "NA" the same option for strings
--where "product_id > 10000" can select a subset from the table that gets imported

Export from HDFS to MySQL

To export a table, you must first create it in MySQL. Log into it as before, then create the table in your target database:

CREATE TABLE <database_name>.<table_name>(
  variable_1 varchar(255) not null,
  variable_2 date not null,
  variable_3 int,
  variable_4 numeric,
  constraint pk_order_result primary key (variable_1, variable_2)
);

Then, a sample export command looks like this:

sqoop export \
  --table result \
  --connect "jdbc:mysql://quickstart.cloudera:3306/<database_name>" \
  --username <user> \
  --password <pass> \
  --export-dir "/user/cloudera/dataset/mainfolder" \
  --columns "variable_1,variable_2,variable_3,variable_4"

If you export from a Hive internal table, the export-dir is something like /user/hive/warehouse/<database_name>/<table_name>.

Check your MySQL database if the table was imported correctly!

Advanced export commands

--input-fields-terminated-by '\0001' is an option you set when exporting from Hive, since the default field delimiter is ASCII value 1 there.
--input-null-string and input-null-non-string to recode from the corresponding values to NULL in the database.