Stinger and Tez: A primer
What is Stinger?
The Stinger initiative aims to redesign Hive to make it what people want today: Hive is currently used for large batch jobs and works great in that sense; but people also want interactive queries, and Hive is too slow today for this. So a big drive is performance, aiming to be 100x faster.
Stinger also covers work that is in HDFS and YARN. An example of this is the new data format, ORC. Thus Stinger is not only focused on Hive, it is a cross-project initiative.
This is work done as open source, by Hortonworks, but also SAP, Microsoft, and Facebook.
Why build upon Hive rather than build a new system?
Hive is out there and being used, and works at scale. It s hard to build a new tool that works at scale; it is easy to build a tool that is fast, but works on a small dataset (Gigabyte range).
Hive already manages at scale, and scale is hard. And now the effort is focused on making it faster. It is easier to do this rather than build in scale afterwards..
Also from a vision perspective, users only need one SQL tool, not a fragmented toolkit. Otherwise different tools are going to use different flavors of SQL, different data types, which is going to create an impedance mismatch and going to cause frustration for users. So the focus is to build just one tool, that does both batch and interactive, and have the optimizer make the right decision: is it a query that’s going to run for an hour so I should use batch type operators, or is it going to take 5 sec so I should run it in an interactive way?
Why is SQL compatibility so important?
SQL is the English of the data world. You can get by pretty much anywhere with English when you are traveling. It is the same way with SQL, it is the language that everyone knows, and there is a workforce of analysts and set of BI tools that speak SQL. It just makes sense that Hadoop can speak SQL.
However Hadoop can not just speak SQL; Hadoop is so much more than that. YARN brings different data processing models: a real time streaming-type models like Storm, a graph processing model like Gyraph, an ETL model like Pig.
Stinger is not only focusing on SQL. Tez and Stinger open up to other tools like Pig, Cascading and make sure everyone can share the benefit.
SQL is how most of the world is going to communicate with Hadoop.
In Hadoop 1.0, there was only one option for how to execute their data processing: Map Reduce.
Map Reduce is a great paradigm for a whole set of problems, but in a context of a relational system that is going to stream together a set of relational operators, like Pig, Hive or Cascading, Map Reduce is not the design that you would choose if you starting out from a clean slate. That’s when Hadoop 2.0 comes in with YARN, which separates out resource management from execution. The execution moves from the system to user space, where the user can write an application manager that controls how the job is executed, so that is the opportunity for Pig and Hive to write an app manager tailored to them: Tez. Tez is application that runs on YARN, that builds the underlying execution for a parallel data processing engine for running relational operators like Group by, Join, sort. Tez is being evolved from Map Reduce, with strengths like recoverability that are baked in, and still scales and can handle faults. The underplaying execution layer can handle multiple inputs, which is needed for joins. Map Reduce 1.0 only accepts one input, and additional code is needed to shoehorn joins into it. In Tez, joins are supported natively in the execution layer.
Also it frees you up in how you move data; in MR you write data to disk in intermediary stages for persistence in HDFS; this is great for large batch jobs, and you don’t want to give that up, but for quick interactive jobs, you want to be able to move data into memory or via sockets so you can stream in data very quickly. The goal behind Tez is to enable both of those, where the optimizer of the application (like Hive, Pig) can makes the choice: should I use socket-based streaming because I know this is going to be really fast, or should I use HDFS-based communication because this job is going to take an hour, and if it blows at the 59th minute I don’t want to loose everything?
Also, Tez is about driving down the job startup times: MR was built where jobs would take an hour on average; so if it took 30 sec to start the job it was just noise. This is the way it was designed. But for interactive jobs, it is important to drive down the startup times from seconds into milliseconds.
Chip architecture optimizations
Also, Tez will be rewritten to add internal optimizers in Hive to take advantage of chip architectures; traditional databases have operated on 1 record at a time. But you can now process multiple records at a time.
The next Stinger phase will be about buffer caching for Hadoop: Hadoop needs to know what it needs to keep in memory vs. what should be on disk. Hadoop was build around the notion that you scan everything on disk every time , and that is a great fit for batches , but some jobs and data sets you need to hold the data in memory in the cache , if for example you are doing some frequent joins with say dimension tables. Web buffer caching strategies will help in that respect.
Hive needs a true cost-based optimizer for Hive, instead of the current rules-based optimizer.
For example it almost always make sense to execute a filter before a join statement. But today Hive is complex enough that there are options in how it executes things, and you don’t know what the right answer is, so you will need to make estimates on what is the optimal plan to execute a particular statement.
What Tez means for Pig and other tools?
Building Tez not just in Hive; Pig: their API’s are fairly different; different tools expose different API’s that are more appropriate for different use cases. Underneath, there are only so many things that you do to data: you can either sort it, join it, group it, filter it, or project it. Having separate implementations for this is wasted effort. Tez optimizes the execution of these operators to make sure it is shareable across the stack to different tools. Source