Improving the Big Data Toolkit

Open source software tends to march into the marketplace step by step, a quiet but steady strategy compared with the grand marketing events of the commercial software world. And Hadoop, the bedrock software of the fast-growing Big Data business, is on the march.

Hadoop allows for relatively inexpensive data analysis, and the next generation will make that analysis possible across many thousands of computers. Hadoop 2.0, as it is known, was released for testing last month and the “general availability” release is planned for October.

Hadoop 2.0, said Merv Adrian, an analyst at Gartner, is “an important step,” making the technology “a far more versatile data operating environment.” The new version of Hadoop, he said, can handle larger data sets faster than its predecessor and it opens the door to analyzing data in real-time streams. So far, Hadoop has been used mostly to divvy up huge sets of data for analysis, but only in batches, not streams. The new Hadoop has also been tweaked to work more easily with traditional database tools, like SQL.

Hadoop 2.0, Mr. Adrian said, was built to include “requirements for the commercial mainstream.” Historically, Hadoop’s most avid users were Internet companies like Yahoo, Facebook and Amazon.

Hadoop 2.0 has been in the works for years, with many programmers designing, refining, testing and debuging the code — the open-source development model. And the history of Hadoop itself is a neat technology tale of sharing, failure, persistence and serendipity. Hadoop traces its origins to research papers published by Google. Hadoop’s creators, Doug Cutting and Mike Cafarella, integrated those concepts into their own code. The project was named after the toy elephant of Mr. Cutting’s son and was originally meant as a tool for Nutch, an open-source search engine.

Today, corporations in many industries are trying to find cost-cutting or sales-improving insights in sensor, Web and social media data. “Everybody has the amount of data Yahoo and Google did five years ago,” said Arun Murthy, who is overseeing the development of Hadoop 2.0 as its release manager in the Apache Software Foundation.

Mr. Murthy is also a co-founder of Hortonworks, a start-up that distributes and provides technical support for Hadoop to companies. Hortonworks is one of a handful of Hadoop distributors, each with its own business model for making money off of the open-source software, including Cloudera and MapR Technologies. Cloudera, which Mr. Cutting helped start, is considered furthest along as a business, with about $100 million in yearly revenue, analysts estimate. By STEVE LOHR Read more

Improving the Big Data Toolkit

Related Posts