Apache Hive Updated with SQL-on-Hadoop Features
The Apache Hive community has voted on and released version 0.13. This is a significant release that represents a major effort from over 70 members who worked diligently to close out over 1080 JIRA tickets.
Hive 0.13 also delivers the third and final phase of the Stinger Initiative, a broad community based initiative to drive the future of Apache Hive, delivering 100x performance improvements at petabyte scale with familiar SQL semantics. These improvements extend Hive beyond its traditional roots and brings true interactive SQL query to Hadoop.
Ultimately, over 145 developers representing 44 companies, from across the Apache Hive community contributed over 390,000 lines of code to the project in just 13 months, nearly doubling the Hive code base.
The three phases of this important project spanned Hive versions 0.11, 0.12 and 0.13. Additionally, the Apache Hive team coordinated this 0.13 release with the simultaneous release of Apache Tez 0.4. Tez’s DAG execution speeds Hive queries run on Tez.
With the delivery of Hive on Tez, users have the option of executing queries on Tez. Tez’s dataflow model on a DAG of nodes facilitates simpler, more efficient query plans, which translates to significant performance improvements and interactive query on Hive / Hadoop.
Some of the techniques that account for the speedup are:
- Broadcast Joins – like MapJoin, but without need to build a hashtable on the client,
- Dynamic Partitioned Hash Joins – to distribute small table based on the Big Table bucketing trait,
- Cardinality estimation-based decision on Join algorithm and parallelism, and
- Pre-launch of containers
Hive now has a vectorized query execution mode that performs CPU computations 5-10x faster, translating to a 2-3x improvement in query performance. Vectorized mode supports:
- All common SQL operators: Project, Filter, MapJoin, SMBJoin, and GroupBy.
- All common SQL functions: In, Case, Between, Comparators, String and Date.
Hive 0.13 introduces a cost-based optimizer supporting join reordering.
Hive 0.13 also includes these other Speed improvements:
- Stats-based short cuts of aggregated queries (e.g. min, max and count)
- Split elimination in ORC, using stripe stats
- Meta store partition pruning for more datatypes
- Faster plan serialization
- Faster MapJoins by improving the Hashtable footprint
- Order of magnitude speedup of fetching Column level Stats
SQL
With the SQL standard-based authorization feature in Hive 0.13, users can now define their authorization policies in an SQL-compliant fashion. We extended SQL language to support grant and revoke on entities. Hive also now supports show roles, user privileges, and active privileges. Version 0.13 has a revamped, pluggable authorization API, which plugs gaps in authorization checks.
Other features added in the SQL category include:
- Support for the DECIMAL and CHAR datatypes
- Unqualified joining conditions
- Standard-based Quoted Identifier behavior
- Common table expressions
- Sub-query for IN, NOT IN, EXISTS and NOT EXISTS (correlated and uncorrelated)
- Permanent functions
- JOIN conditions in the WHERE clause
The ongoing ACID work lays the groundwork for managing dimension tables and other master data, with the guarantee of consistent and repeatable reads. It also introduces transactions and support for streaming data into Hive. Hive 0.13 gives a preview of this functionality through allowing data to be streamed into Hive using Apache Flume, making data available for query within seconds.
Additional Improvements
Hive 0.13 adds many improvements to HiveServer2, HCatalog and JDBC access:
- Hive Server 2
- HTTP support
- SSL support for both binary and HTTP (HTTPS)
- Kerberos authentication over HTTP(S)
- Support for HTTP(S) through a trusted proxy
- HCatalog
- HCatalog parity for all Hive data types
- Reconciliation of HCatalog and Hive “INSERT INTO” semantics
- JDBC
- Support for JDBC job cancel
- Async execution
All of these Hive improvements mean that Hive 0.13 accepts a very large percentage of TPC-DS benchmark queries without rewrites. Read more