To use Hadoop for analytics requires loading data into Hadoop clusters and processing it in conjunction (combination) with data that resides on enterprise application servers and databases. Loading GBs and TBs of data into HDFS from production databases or accessing it from map reduce applications is a challenging task. While doing so, we have to consider things like data consistency, overhead of running these jobs on production systems and at the end if this process would be efficient or not. Using batch scripts to load data is an inefficient way to go with.
Typical scenario, we want to process data stored in relational Database Management Systems with Map reduce capabilities. E.g. we have say, Legacy data or Lookup Tables etc. to process. Once solution we can think about is directly reading data from RDBMS tables in your mapper and the process the same. But this would lead to the equivalent of a distributed denial of service (DDoS) attack on your RDBMS. so in practice, don’t do it.
So what could be the possible solution to process data stored in relational database using Map reduce? Answer is importing that data on HDFS!!
Sqoop is nothing but SQL to Hadoop. Sqoop allows users to import data from their relational databases into HDFS and vice versa. Sqoop is an open source tool written at Cloudera .It allows us to
• Import one table
• Import complete database
• Import selected tables
• Import selected columns from a particular table
• Filter out certain rows from certain table etc.
Sqoop uses MapReduce to fetch data from RDBMS and stores that on HDFS. While doing so, it throttles the number of mappers accessing data on RDBMS to avoid DDoS. By default it uses four mappers but this value is configurable. It’s recommended not to set higher number of mappers, as it may lead to consuming all spool space of the source database.
Sqoop internally uses JDBC interface so it should work with any JDBC compatible database
Once we have installed sqoop, it’s very easy to use sqoop command line tool. To start sqoop, you have to type sqoop with arguments and tools to run.
$ sqoop tool-name [tool-arguments]
Here tool-name would be the operation you would like to perform, e.g. import, export etc. While tool-arguments would be the extra parameters you need to specify which would help execution of tool. Example of tool-arguments could be –connect which would specify the source database URL or –username would allow you to specify the user with which we want to connect the source database.
Sqoop has help tool with it, so list down all available tools, we can run following command
$ sqoop help usage: sqoop COMMAND [ARGS] Available commands: codegen Generate code to interact with database records create-hive-table Import a table definition into Hive eval Evaluate a SQL statement and display the results export Export an HDFS directory to a database table help List available commands import Import a table from a database to HDFS import-all-tables Import tables from a database to HDFS list-databases List available databases on a server list-tables List available tables in a database version Display version information See 'sqoop help COMMAND' for information on a specific command.