Using Sqoop to move data between HDFS and MySQL
This post is part of my preparation series for the Cloudera CCA175 exam, “Certified Spark and Hadoop Developer”.
Check MySQL
Before using Sqoop to import data from MySQL, you should connect to the database and make sure everything exists as expected. From the command line, access a MySQL on localhost via
You could enter a database name at the end, but you don’t have to. Once inside, you can poke around and check the database contents:
Sqoop needs permissions to access the tables. You grant these permissions from MySQL like so:
Import from MySQL to HDFS
The core command here is sqoop import
, along with a lot of parameters. This is an extensive example command:
Additionally to importing a table, this command compresses the HDFS file using the snappy codec (alternatives to .SnappyCodec
are BZip2Codec
, GzipCodec
, and DefaultCodec
).
It also imports the file as an Avro file (instead of the default, a text file).
Advanced import commands
--fields-terminated-by '|'
changes the field delimiter from the default\t
to another one.--hive-import
can import the data directly into the Hive warehouse (instead of HDFS, the default). Use--hive-overwrite
to replace an existing table.--driver
is only relevant if the connection string does not begin withjdbc:mysql://
.--null-non-string -99
to recode NULL values from the database into -99 for non-string data types.--null-string "NA"
the same option for strings--where "product_id > 10000"
can select a subset from the table that gets imported
Export from HDFS to MySQL
To export a table, you must first create it in MySQL. Log into it as before, then create the table in your target database:
Then, a sample export command looks like this:
If you export from a Hive internal table, the export-dir is something like /user/hive/warehouse/<database_name>/<table_name>
.
Check your MySQL database if the table was imported correctly!
Advanced export commands
--input-fields-terminated-by '\0001'
is an option you set when exporting from Hive, since the default field delimiter is ASCII value 1 there.--input-null-string
andinput-null-non-string
to recode from the corresponding values to NULL in the database.