Impala in terms of Hadoop has got the significance because of its,

  • Scalability
  • Flexibility
  • Efficiency

cloudera_impalaWhat’s Impala?

Impala is…

Interactive SQL–Impala is typically 5 to 65 times faster than Hive as it minimized the response time to just seconds, not minutes.

Nearly ANSI-92 standard and compatible with Hive SQL –Impala is a compatible SQL interface for the existing CDH or Hadoop Applications. it is based on the industry standard SQL.

Native on Hadoop/HBase Storage – Impala is natively built on Hadoop. So, the advantages and benefits of Hadoop like scaling, cost advantage and flexibility are considered to be advantages of the Impala. The existingHadoop system along with HBase storage can be used with the Impala. The data and metadata do need to be duplicated or synchronized between the multiple systems for it. the local processing is again ensured here in order to avoid the network bottlenecks.

Separate and special runtime from MapReduceMapreduce is generally designed for the batch processing. When it comes to the Impala Tutorial, it is purpose-built especially for the SQL queries of low-latency on Hadoop.


1.      Better Value–Once Impala came into existence, the BI tools became practical on Hadoop. It enables to move from 10s of Hadoop users to 100s of SQL users per clusters. There is hardly any delay from the data migration.

2.      Flexibility–The entire existing data can be queried across. The best-fit file formats are selected here. There can be multiple frameworks simultaneously on the same data.

3.      Cost Efficiency–Since there would be a drastic reduction of movement, computing and duplicate storage, Impala ensures cost efficiency. This cost would be from 10% to 1% that would take for the analytic DBMS.

4.      Full Fidelity Analysis–It ensures no loss from the fixed schemas or aggregations.

Impala Query Execution:

1.      The Request or query is reached to Impala via ODBC/JDBC/Shell/Beeswax.

2.      The requests are turned to be a collection of plan fragments by the planner.

3.      Then the coordinator takes the control and initiates the execution. The implad(s) that are local to the data would then be executed.

4.      The intermediate results during the entire processing are then streamed between these impalads.

5.      The results of the query are finally streamed back to the client.


Though metadata, ODBC/JDBC Drivers, Flexible File Formats, Hue GUI, SQL Syntax and Machine pool are shared mostly common by Impala and Hive, however, they are built for different purposes.

Hive runs on top of MapReduce and it is quite suitable and ideal for batch processing.

When it comes to Impala, the fundamental difference here is that it is native MPP query engine and it is ideal for advanced purpose, Interactive SQL.


Impala is more effective and useful in the following cases. It becomes more cost-effective and ad hoc query environment that would eventually offload the data house for,

–          In the cases, where interactive BI/analytics on more data.

–          Query-able archive with full fidelity.

–          When data processing is with tight SLAs


–          There would be only one pool of data.

–          There is only one metadata model

–          One security framework is implemented and

–          There is only one set of system resources


When compared to the Batch MapReduce, Remote Query, Siloed DBMS Impala, which is integrated into Hadoop, gains the benefits of fast, flexible and cost-effective.