Why to choose bucketing in Hive?

There are two reasons why we might want to organize our tables (or partitions) into buckets. The first is to enable more efficient queries. Bucketing imposes extra structure on the table, which Hive can take advantage of when performing certain queries. In particular, a join of two tables that are bucketed on the same columns – which include the join columns – can be efficiently implemented as a map-side join.

The second reason to bucket a table is to make sampling more efficient. When working with large datasets, it is very convenient to try out queries on a fraction of your dataset while you are in the process of developing or refining them.

Let’s see how to tell Hive that a table should be bucketed. We use the CLUSTERED BY clause to specify the columns to bucket on and the number of buckets:

CREATE TABLE student (rollNo INT, name STRING) CLUSTERED BY (id) INTO 4 BUCKETS;

Hive Tips :: Joins occur Before Where Clause

Joins occur BEFORE WHERE CLAUSES. So, if we want to restrict the OUTPUT of a join, a requirement should be made in the WHERE clause, otherwise it should be in the JOIN clause. A big point of confusion for this issue is partitioned tables:

SELECT a.val, b.val FROM a LEFT OUTER JOIN b ON (a.key=b.key)
WHERE a.ds='2015-10-01' AND b.ds='2015-10-01'

Above query will join a on b, producing a list of a.val and b.val. The WHERE clause, however, can also reference other columns of a and b that are in the output of the join, and then filter them out. However, whenever a row from the JOIN has found a key for a and no key for b, all of the columns of b will be NULL, including the ds column. This is to say, you will filter out all rows of join output for which there was no valid b.key, and thus you have outsmarted your LEFT OUTER requirement. In other words, the LEFT OUTER part of the join is irrelevant if you reference any column of b in the WHERE clause. Instead, when OUTER JOINing, use this syntax:

SELECT a.val, b.val FROM a LEFT OUTER JOIN b
ON (a.key=b.key AND b.ds='2015-10-01' AND a.ds='2015-10-01')

Therefore, the result is that the output of the join is pre-filtered, and you won’t get post-filtering trouble for rows that have a valid a.key but no matching b.key. The same logic applies to RIGHT and FULL joins.

What is Hive?

Hive is one of the important tools in Hadoop eco-system and it provides an SQL like dialect to Hadoop Distributed File System (HDFS).

Features of Hive:-

  • Tools to enable easy data extract/transform/load (ETL)
  • A mechanism to project structure on a variety of data formats
  • Access to files stored either directly in HDFS or other data storage systems as HBase
  • Query execution through MapReduce jobs.
  • SQL like language called HiveQL that facilitates querying and managing large data sets residing in Hadoop.

Limitations of Hive:-

  • Hive is best suited for data warehouse applications, where a large data set is maintained and mined for insights, reports, etc.
  • Hive does not provide record-level update, insert or delete.
  • Hive queries have higher latency than SQL queries, because of start-up overhead for MapReduce jobs submitted for each hive query.
  • As Hadoop is a batch-oriented system, Hive doesn’t support OLTP (Online Transaction Processing).
  • Hive is close to OLAP (Online Analytic Processing) but not ideal since there is significant latency between issuing a query and receiving a reply, both due to the overhead of Mapreduce jobs and due to the size of the data sets Hadoop was designed to serve.