What is Hive?

Hive is one of the important tools in Hadoop eco-system and it provides an SQL like dialect to Hadoop Distributed File System (HDFS).

Features of Hive:-

  • Tools to enable easy data extract/transform/load (ETL)
  • A mechanism to project structure on a variety of data formats
  • Access to files stored either directly in HDFS or other data storage systems as HBase
  • Query execution through MapReduce jobs.
  • SQL like language called HiveQL that facilitates querying and managing large data sets residing in Hadoop.

Limitations of Hive:-

  • Hive is best suited for data warehouse applications, where a large data set is maintained and mined for insights, reports, etc.
  • Hive does not provide record-level update, insert or delete.
  • Hive queries have higher latency than SQL queries, because of start-up overhead for MapReduce jobs submitted for each hive query.
  • As Hadoop is a batch-oriented system, Hive doesn’t support OLTP (Online Transaction Processing).
  • Hive is close to OLAP (Online Analytic Processing) but not ideal since there is significant latency between issuing a query and receiving a reply, both due to the overhead of Mapreduce jobs and due to the size of the data sets Hadoop was designed to serve.
Advertisements

Introduction To Apache Sqoop

To use Hadoop for analytics requires loading data into Hadoop clusters and processing it in conjunction (combination) with data that resides on enterprise application servers and databases. Loading GBs and TBs of data into HDFS from production databases or accessing it from map reduce applications is a challenging task. While doing so, we have to consider things like data consistency, overhead of running these jobs on production systems and at the end if this process would be efficient or not. Using batch scripts to load data is an inefficient way to go with.

Typical scenario, we want to process data stored in relational Database Management Systems with Map reduce capabilities. E.g. we have say, Legacy data or Lookup Tables etc. to process. Once solution we can think about is directly reading data from RDBMS tables in your mapper and the process the same. But this would lead to the equivalent of a distributed denial of service (DDoS) attack on your RDBMS. so in practice, don’t do it.
So what could be the possible solution to process data stored in relational database using Map reduce? Answer is importing that data on HDFS!!

Sqoop is nothing but SQL to Hadoop. Sqoop allows users to import data from their relational databases into HDFS and vice versa. Sqoop is an open source tool written at Cloudera .It allows us to
• Import one table
• Import complete database
• Import selected tables
• Import selected columns from a particular table
• Filter out certain rows from certain table etc.

Sqoop uses MapReduce to fetch data from RDBMS and stores that on HDFS. While doing so, it throttles the number of mappers accessing data on RDBMS to avoid DDoS. By default it uses four mappers but this value is configurable. It’s recommended not to set higher number of mappers, as it may lead to consuming all spool space of the source database.

Sqoop internally uses JDBC interface so it should work with any JDBC compatible database

Starting Sqoop
Once we have installed sqoop, it’s very easy to use sqoop command line tool. To start sqoop, you have to type sqoop with arguments and tools to run.

$ sqoop tool-name [tool-arguments]

Here tool-name would be the operation you would like to perform, e.g. import, export etc. While tool-arguments would be the extra parameters you need to specify which would help execution of tool. Example of tool-arguments could be –connect which would specify the source database URL or –username would allow you to specify the user with which we want to connect the source database.

Sqoop has help tool with it, so list down all available tools, we can run following command

$ sqoop help 
usage: sqoop COMMAND [ARGS]
Available commands:
codegen				Generate code to interact with database records
create-hive-table	Import a table definition into Hive
eval				Evaluate a SQL statement and display the results
export				Export an HDFS directory to a database table
help				List available commands
import				Import a table from a database to HDFS
import-all-tables	Import tables from a database to HDFS
list-databases		List available databases on a server
list-tables			List available tables in a database
version				Display version information
See 'sqoop help COMMAND' for information on a specific command.

How to Import/Export Putty configurations in Windows

Scenario-1

Run the below command on a source system (For Backup/Export existing session list)

regedit /e “%userprofile%\desktop\putty-sessions.reg” HKEY_CURRENT_USER\Software\SimonTatham\PuTTY\Sessions

To copy the putty-sessions.reg file on remote system, click the right mouse button on file & merge it.


Scenario-2

Exporting Registry Files
To export all or part of the registry remotely, use Registry Editor. Once you have opened Registry Editor, you can export the registry to a text file. You can use a text editor like Notepad to work with registry files you create by exporting. Registry files are saved with .reg extensions, and text files are saved with .txt extensions.

(1) Open Registry Editor. If you want to save only a particular branch, select it.
(2) On the File menu, click Export….
(3) In File name, enter a name for the registry file.
(4) In Save as type, select the file type you wish to use for the saved file
(5) In Export Range, use the radio buttons to select whether you want to export the entire registry or only the selected branch.
(6) Click Save.

Importing Registry Files
The Import… command in Registry Editor can import registry files of all types, including text files
(1) Open Registry Editor.
(2) On the File menu, click Import….
(3) Find the file you want to import, click the file to select it, and then click Open.