Chapter 4: Sqoop Import
We have come to one of the best use of Sqoop that is Sqoop Import. Sqoop import tool is used to import the tables from a relational database like MySQL, Oracle SQL etc.to the Hadoop system like Sqoop import to HDFS or Hbase etc.
Each row of an RDBMS table is treated as the records of the file. It will store text data in text file and binary data in sequence file or Avro file.
Let me explain this Sqoop Import function using a simple example here.
For example, I have one table as below-
The table name is “emp” in the database “userdb” in MySQL database server
Now we will import this table emp data into HDFS. Let’s see how easy it is-
Sqoop ‘Import’ tool is used to import the table. Sqoop provides some command to get this tool done.
$ sqoop import \ --connect jdbc:mysql://localhost/userdb \ --username root \ --table emp --m 1
Once you will run the above commands, you will find it running and using some mappers. Once done, you can also see the output using below commands-
$HADOOP_HOME/bin/hadoop fs -cat /emp/part-m-*
You will find the output file in HDFS something like below. It will be a comma (,) separated.
Using the above command, the file will be saved in a default location in HDFS.
Import Table data in a particular location
But what if we want to save the file in a particular location (let’s say rdbmsdata) in HDFS?
You can specify the location (target directory) in above command. For this, we use the below command-
So the complete command will be something like below-
$ sqoop import \ --connect jdbc:mysql://localhost/userdb \ --username root \ --table emp \ --m 1 \ --target-dir / rdbmsdata
Verify the result using the below command-
$HADOOP_HOME/bin/hadoop fs -cat /rdbmsdata/part-m-*
Import data based on a condition
Sqoop also provides the feasibility to import some part of the table. You can use “where” condition to insert some condition in the table.
For example, we just need those employee details who is a manager then simply put the condition in the above query and get that done.
The overall query will be like below-
$ sqoop import \ --connect jdbc:mysql://localhost/userdb \ --username root \ --table emp \ --m 1 \ --where “Designation =’Manager’” \ --target-dir /folderlocation
You can verify the file using the below query-
$HADOOP_HOME/bin/hadoop fs -cat /folderlocation/part-m-*
This section will be highly helpful to you if you are using on real time data. For example, you are working on some employee data of a company and everyday new hiring is getting done and so new employee will get added as well.
So in such situation, you will only want to import only the newly updated records and in such case incremental import will help you a lot.
To get this done, you need mainly three things-
- Incremental- Mode like appended
- check-column – based on which column you want to decide the incremental value
- last value – What was the last value you imported so that it will take next value after that
Now considering a new employee has been added in the emp table with the record as below-
104, Henning, 4000, IT, Developer
And we want to take ID column as the check column and so the last value was 103 and the new is 104.
The complete query to import the newly added rows in the table emp.
Let’s see the query-
sqoop import \ --connect jdbc:mysql://localhost/userdb \ --username root \ --table emp \ --m 1 \ --incremental append \ --check-column id \ --last-value 103
Again, you can check the output using the below command-
$HADOOP_HOME/bin/hadoop fs -cat /emp/part-m-*
As you have seen so far, the result will be comma separated by default while importing the data from the relational database to Hadoop.
In all the above examples, I have shown you how to import table in HDFS through Sqoop. Now let me tell you, how to import data into HBase with Sqoop.
How to Import Table Data in HBase with Sqoop?
Like we imported table in HDFS through Sqoop, in a similar way we can do for Hbase as well but there is some twist.
Sqoop doesn’t now permit you to import a relational table directly into an HBase table having multiple column families. To work around this limitation, you can create the HBase table first and then execute three Sqoop import operations to finish the task.
Let’s start by creating an Hbase table first-
Create ‘tablename’, ‘columnfamily’
For example: create ‘empdetails’, ‘personal data’, ‘professional data’
Once you have created a table in HBase, you are all set to write the command to import the RDBMS table into Hbase through Sqoop. Below is the code-
sqoop import \ --connect jdbc:mysql://localhost/serviceorderdb \ --username root -P \ --table emp \ --columns "designation,name" \ --hbase-table empdetails \ --column-family personal details \ --hbase-row-key ID -m 1
Once executed, you can scan the HBase table to check the output.
Sqoop Import All Tables
There will be a time when you will need to import all the tables of a database to Hadoop and in that case, Import_all command of Sqoop will help you out.
Let’s say I have a database in MySQL called “emp_warehouse” which has around 1k tables for different purposes and I want to import all the tables in Hadoop then I will simply write the code like below and done-
$ sqoop import-all-tables \ --connect jdbc:mysql://localhost/emp_warehouse \ --username root
Here the tool will create the directory as per the table name in HDFS.
One more thing here you should note that, if you want to import the complete database from any relational databases then you should have at least one primary key in all your table.