Apache Hive Updated with SQL-on-Hadoop Features

The Apache Hive community has voted on and released version 0.13. This is a significant release that represents a major effort from over 70 members who worked diligently to close out over 1080 JIRA tickets.

Hive 0.13 also delivers the third and final phase of the Stinger Initiative, a broad community based initiative to drive the future of Apache Hive, delivering 100x performance improvements at petabyte scale with familiar SQL semantics. These improvements extend Hive beyond its traditional roots and brings true interactive SQL query to Hadoop.

Ultimately, over 145 developers representing 44 companies, from across the Apache Hive community contributed over 390,000 lines of code to the project in just 13 months, nearly doubling the Hive code base.

The three phases of this important project spanned Hive versions 0.11, 0.12 and 0.13. Additionally, the Apache Hive team coordinated this 0.13 release with the simultaneous release of Apache Tez 0.4. Tez’s DAG execution speeds Hive queries run on Tez.

Speed & Scale

With the delivery of Hive on Tez, users have the option of executing queries on Tez. Tez’s dataflow model on a DAG of nodes facilitates simpler, more efficient query plans, which translates to significant performance improvements and interactive query on Hive / Hadoop.

Some of the techniques that account for the speedup are:

Broadcast Joins – like MapJoin, but without need to build a hashtable on the client,
Dynamic Partitioned Hash Joins – to distribute small table based on the Big Table bucketing trait,
Cardinality estimation-based decision on Join algorithm and parallelism, and
Pre-launch of containers

Hive now has a vectorized query execution mode that performs CPU computations 5-10x faster, translating to a 2-3x improvement in query performance. Vectorized mode supports:

All common SQL operators: Project, Filter, MapJoin, SMBJoin, and GroupBy.
All common SQL functions: In, Case, Between, Comparators, String and Date.

Hive 0.13 introduces a cost-based optimizer supporting join reordering.

Hive 0.13 also includes these other Speed improvements:

Stats-based short cuts of aggregated queries (e.g. min, max and count)
Split elimination in ORC, using stripe stats
Meta store partition pruning for more datatypes
Faster plan serialization
Faster MapJoins by improving the Hashtable footprint
Order of magnitude speedup of fetching Column level Stats

SQL

With the SQL standard-based authorization feature in Hive 0.13, users can now define their authorization policies in an SQL-compliant fashion. We extended SQL language to support grant and revoke on entities. Hive also now supports show roles, user privileges, and active privileges. Version 0.13 has a revamped, pluggable authorization API, which plugs gaps in authorization checks.

Other features added in the SQL category include:

Support for the DECIMAL and CHAR datatypes
Unqualified joining conditions
Standard-based Quoted Identifier behavior
Common table expressions
Sub-query for IN, NOT IN, EXISTS and NOT EXISTS (correlated and uncorrelated)
Permanent functions
JOIN conditions in the WHERE clause

The ongoing ACID work lays the groundwork for managing dimension tables and other master data, with the guarantee of consistent and repeatable reads. It also introduces transactions and support for streaming data into Hive. Hive 0.13 gives a preview of this functionality through allowing data to be streamed into Hive using Apache Flume, making data available for query within seconds.

Additional Improvements

Hive 0.13 adds many improvements to HiveServer2, HCatalog and JDBC access:

Hive Server 2
- HTTP support
- SSL support for both binary and HTTP (HTTPS)
- Kerberos authentication over HTTP(S)
- Support for HTTP(S) through a trusted proxy
HCatalog
- HCatalog parity for all Hive data types
- Reconciliation of HCatalog and Hive “INSERT INTO” semantics
JDBC
- Support for JDBC job cancel
- Async execution