Twitter open sources Storm-Hadoop hybrid called Summingbird
Twitter has open sourced a system that aims to mitigate the tradeoffs between batch processing and stream processing by combining them into a hybrid system. In the case of Twitter, Hadoop handles batch processing, Storm handles stream processing, and the hybrid system is called Summingbird. It’s not a tool for every job, but it sounds pretty handy for those it’s designed to address.
Twitter’s blog post announcing Summingbird is pretty technical, but the problem is pretty easy to understand if you think about how Twitter works. Services like Trending Topics and search require real-time processing of data to be useful, but they eventually need to be accurate and probably analyzed a little more thoroughly. Storm is like a hospital’s triage unit, while Hadoop is like longer-term patient care.
This description of Summingbird from the project’s wiki does a pretty good job of explaining how it works at a high level. The implementation is a little more complex, of course:
The hybrid model allows most data to be processed by Hadoop and served out of a read-only store like Manhattan. Only data that Hadoop hasn’t yet been able to process, data that falls within the latency window, would be served out of a datastore populated in realtime by Storm. The error of the realtime layer is bounded, as Hadoop will eventually get around to processing the same data and smoothing out any error introduced.
Hybrid systems like this are actually becoming more common as companies realize they can’t survive in a real-time world with Hadoop alone. We’ve covered systems at numerous companies — Gravity, LinkedIn and Netflix among them — that aim to do something similar. Summingbird might be different in that it’s a hybrid system handling data from both Hadoop and Storm, as opposed to a pipeline of different systems, but web companies need some way to ensure they’re not trading off speed for accuracy, or vice versa. By Derrick Harris Read more