Why to choose bucketing in Hive?

There are two reasons why we might want to organize our tables (or partitions) into buckets. The first is to enable more efficient queries. Bucketing imposes extra structure on the table, which Hive can take advantage of when performing certain queries. In particular, a join of two tables that are bucketed on the same columns – which include the join columns – can be efficiently implemented as a map-side join.

The second reason to bucket a table is to make sampling more efficient. When working with large datasets, it is very convenient to try out queries on a fraction of your dataset while you are in the process of developing or refining them.

Let’s see how to tell Hive that a table should be bucketed. We use the CLUSTERED BY clause to specify the columns to bucket on and the number of buckets:


Hive Tips :: Joins occur Before Where Clause

Joins occur BEFORE WHERE CLAUSES. So, if we want to restrict the OUTPUT of a join, a requirement should be made in the WHERE clause, otherwise it should be in the JOIN clause. A big point of confusion for this issue is partitioned tables:

SELECT a.val, b.val FROM a LEFT OUTER JOIN b ON (a.key=b.key)
WHERE a.ds='2015-10-01' AND b.ds='2015-10-01'

Above query will join a on b, producing a list of a.val and b.val. The WHERE clause, however, can also reference other columns of a and b that are in the output of the join, and then filter them out. However, whenever a row from the JOIN has found a key for a and no key for b, all of the columns of b will be NULL, including the ds column. This is to say, you will filter out all rows of join output for which there was no valid b.key, and thus you have outsmarted your LEFT OUTER requirement. In other words, the LEFT OUTER part of the join is irrelevant if you reference any column of b in the WHERE clause. Instead, when OUTER JOINing, use this syntax:

ON (a.key=b.key AND b.ds='2015-10-01' AND a.ds='2015-10-01')

Therefore, the result is that the output of the join is pre-filtered, and you won’t get post-filtering trouble for rows that have a valid a.key but no matching b.key. The same logic applies to RIGHT and FULL joins.

What is Hive?

Hive is one of the important tools in Hadoop eco-system and it provides an SQL like dialect to Hadoop Distributed File System (HDFS).

Features of Hive:-

  • Tools to enable easy data extract/transform/load (ETL)
  • A mechanism to project structure on a variety of data formats
  • Access to files stored either directly in HDFS or other data storage systems as HBase
  • Query execution through MapReduce jobs.
  • SQL like language called HiveQL that facilitates querying and managing large data sets residing in Hadoop.

Limitations of Hive:-

  • Hive is best suited for data warehouse applications, where a large data set is maintained and mined for insights, reports, etc.
  • Hive does not provide record-level update, insert or delete.
  • Hive queries have higher latency than SQL queries, because of start-up overhead for MapReduce jobs submitted for each hive query.
  • As Hadoop is a batch-oriented system, Hive doesn’t support OLTP (Online Transaction Processing).
  • Hive is close to OLAP (Online Analytic Processing) but not ideal since there is significant latency between issuing a query and receiving a reply, both due to the overhead of Mapreduce jobs and due to the size of the data sets Hadoop was designed to serve.

Introduction To Apache Sqoop

To use Hadoop for analytics requires loading data into Hadoop clusters and processing it in conjunction (combination) with data that resides on enterprise application servers and databases. Loading GBs and TBs of data into HDFS from production databases or accessing it from map reduce applications is a challenging task. While doing so, we have to consider things like data consistency, overhead of running these jobs on production systems and at the end if this process would be efficient or not. Using batch scripts to load data is an inefficient way to go with.

Typical scenario, we want to process data stored in relational Database Management Systems with Map reduce capabilities. E.g. we have say, Legacy data or Lookup Tables etc. to process. Once solution we can think about is directly reading data from RDBMS tables in your mapper and the process the same. But this would lead to the equivalent of a distributed denial of service (DDoS) attack on your RDBMS. so in practice, don’t do it.
So what could be the possible solution to process data stored in relational database using Map reduce? Answer is importing that data on HDFS!!

Sqoop is nothing but SQL to Hadoop. Sqoop allows users to import data from their relational databases into HDFS and vice versa. Sqoop is an open source tool written at Cloudera .It allows us to
• Import one table
• Import complete database
• Import selected tables
• Import selected columns from a particular table
• Filter out certain rows from certain table etc.

Sqoop uses MapReduce to fetch data from RDBMS and stores that on HDFS. While doing so, it throttles the number of mappers accessing data on RDBMS to avoid DDoS. By default it uses four mappers but this value is configurable. It’s recommended not to set higher number of mappers, as it may lead to consuming all spool space of the source database.

Sqoop internally uses JDBC interface so it should work with any JDBC compatible database

Starting Sqoop
Once we have installed sqoop, it’s very easy to use sqoop command line tool. To start sqoop, you have to type sqoop with arguments and tools to run.

$ sqoop tool-name [tool-arguments]

Here tool-name would be the operation you would like to perform, e.g. import, export etc. While tool-arguments would be the extra parameters you need to specify which would help execution of tool. Example of tool-arguments could be –connect which would specify the source database URL or –username would allow you to specify the user with which we want to connect the source database.

Sqoop has help tool with it, so list down all available tools, we can run following command

$ sqoop help 
usage: sqoop COMMAND [ARGS]
Available commands:
codegen				Generate code to interact with database records
create-hive-table	Import a table definition into Hive
eval				Evaluate a SQL statement and display the results
export				Export an HDFS directory to a database table
help				List available commands
import				Import a table from a database to HDFS
import-all-tables	Import tables from a database to HDFS
list-databases		List available databases on a server
list-tables			List available tables in a database
version				Display version information
See 'sqoop help COMMAND' for information on a specific command.

Vertical Scaling vs Horizontal Scaling

Vertical Scaling is the addition of extra resources to the servers in a network. For example, a business runs up against storage capacity limits, so they choose to add extra or larger hard drives to their existing servers. It is the old fashioned approach, and it tends to be significantly more expensive than the horizontal scaling we’re going to look at below. If a company expects to be vertically scaling their existing hardware, then that hardware has to be built with the possibility in mind. It has to have excess capacity, extra drive bays, and so on, that will end up sitting idle until the moment of expansion comes. Such hardware tends to be significantly more expensive than commodity hardware: hardware prices don’t scale linearly.

Vertical scaling is also fairly inefficient when it comes to resource reallocation. Because servers tend to be dedicated to particular tasks, it can be hard to reallocate the spare resources of a server to other, more pressing tasks. That’s why virtualization has become so popular.

Horizontal Scaling, on the other hand, involves adding more nodes to the system as it scales, rather than beefing up the existing nodes. This is by far the most popular scaling strategy in modern times. Google, Facebook, and other huge enterprises rely on using many nodes of relatively inexpensive commodity-class hardware, which they can add to and retask as the need arises. Because horizontal scaling is the growth method of choice for very large Internet enterprises, software has been developed that makes distributing file serving, databases, and virtualization across an expanding pool of nodes practical. For example, Google’s Spanner is a distributed database that is designed to scale at globally. The increasingly popular Hadoop software framework has horizontal scaling as a key design principle.