bigData_wcloudNEWThe Big Data world is on a much firmer footing, as Apache Foundation’s Hadoop version 2 makes the platform more usable for business, and vendors are exploring possibilities to deliver easier-to-use software, according to the founder of the Hadoop movement.

Version 2 of the open source project includes Yarn, a job scheduler which allows different kinds of tasks to be managed on a Hadoop system. Future work on the standard is also going to be more organised, with less risk of individual projects splitting off from the mainstream, according to Doug Cutting, of Cloudera – and his company has a change of licensing structure on the way, too.
Hadoop 2 – a social milestone

fast elephant hadoop real time big data © shutterstock 1971yes

“Socially, Hadoop 2 is a milestone. We are now all working on the same thing,” Cutting  told TechWeekEurope at O’Reilly’s Strata event in London – a major Big Data fest. “We have methodologies where we can collaborate effectively.”

There is a plethora of open source projects around Hadoop, handling different data stores and different ways to access and search data. Those will continue – and Cutting sees the variety as a major strength of the Hadoop community, but he said there is now a “much more unified horizon” for the technical work: “People were doing things piecemeal, now we are all clicking along together.”

New items and branches can be created and developed before being passed into the mainstream of Hadoop, and this procedure is now better agreed, he told us: “There was a lot of suspicion about groups of developers feeling others might break things or saddle them with something that is not ready.”  Now, this is done in a well-organised manner.

“We are executing as a set of competitors and moving it forward,” he said.

Apache prefers projects that are diverse, with developers from multiple vendors, he says because there is less risk of things being decided offline, in the company which is effectively running the project.

That doesn’t stop some of these projects succeeding: for instance Cloudera’s Impala real-time query system is open source, and has been used by its rivals, such as MapR, despite being effectively under Cloudera’s control.  Projects like this can effectively become more widely owned in future if rivals decide to get involved in building them, Cutting said.

It’s more worrying when things in the Hadoop space fork or spin off outside the Apache Foundation’s oversight. There are some things which might never go into Apache, and some are even proprietary code,  but this is “not inherently a bad thing,” said Cutting. It does create risk, but it also allows vendors to create things that are beyond what is available in the central Hadoop projects. read more