Hadoop still too slow for real-time analysis applications?
With all the buzz that Hadoop is generating in IT circles these days, it’s easy to start thinking that the open source distributed processing framework can handle just about anything in big data environments. But real-time analysis involving ad hoc querying of Hadoop data has been a notable exception.
Hadoop is optimized to crunch through large sets of structured, unstructured and semi-structured data, but it was designed as a batch processing system — something that doesn’t lend itself to fast data analysis performance.
And Jan Gelin, vice president of technical operations at Rubicon Project, said analytics speed is something that the online advertising broker needs — badly.
Rubicon Project is based in Playa Vista, Calif., and offers a platform for advertisers to use in bidding for ad space on webpages as Internet users visit the pages. The system allows the advertisers to see information about website visitors before making bids to try to ensure that ads will be seen only by interested consumers. Gelin said the process involves a lot of analytics, and it all has to happen in fractions of a second.
Rubicon leans heavily on Hadoop to help power the ad-bidding platform. But the key, Gelin said, is to pair Hadoop with other technologies that can handle true real-time analytics. Rubicon uses the Storm complex event processing engine to capture and quickly analyze large amounts of data as part of the ad bidding process. Storm then sends the data into a cluster running MapR Technologies Inc.’s Hadoop distribution. The Hadoop cluster is primarily used to transform the data to prepare it for more traditional analytical applications, such as business intelligence reporting. Even for that stage, though, much of the information is loaded into a Greenplum analytical database after the transformation process is completed.
Hadoop realism
Gelin said the sheer volume of data that Rubicon produces on a daily basis pointed it toward Hadoop’s processing muscle. But when it comes to analyzing the data, he added, “You can’t take away the fact that Hadoop is a batch-processing system. There are other things on top of Hadoop you can play around with that are actually like real real-time.”
Several Hadoop vendors are trying to eliminate the real-time analytics restrictions. Cloudera Inc. got the ball rolling in April by releasing its Impala query engine, promising the ability to run interactive SQL queries against Hadoop data with near-real-time performance. Pivotal, a data management and analytics spinoff from EMC Corp. and its subsidiary VMware, followed three months later with a similar query engine named Hawq. Also looking to get in the game is Splunk Inc., which focuses on capturing streams of machine-generated data; it made a Hadoop data analysis tool called Hunk generally available in late October. Read more