A Quick Guide To Choosing The Right Way To Use Hadoop
By Dan Woods, The market for Hadoop and related products is one of the most active in all of enterprise software. I’ve developed a simple framework that can help quickly explain the differences in the way Hadoop distributions are created and packaged, which may help you find the product that is right for you. While this framework is far from an exhaustive method for evaluating what is a big decision, I’ve found that most people I have shown it to have either learned something or had a new thought or two.
Here’s why such a framework matters. Given all of the buzz about big data, CEOs and boards of directors are demanding that CIOs and CTOs explain what, if anything, a company needs to do about it. In most cases, the answer to this question boils down to two things:
Starting a search inside and outside the company for relevant big data that has a chance of providing important signals of value to the business
Developing a capacity to analyze big data, extract signals, and combine it with other types of data to create value for the business
As I’ve pointed out before (see “Do You Suffer from the Data Not Invented Here Syndrome” and “How External Data Opens Up a Disruptive Frontier for Business Intelligence”), the search for data is under-emphasized. But choosing the right approach and supporting technology is important too. While there are many ways of processing big data, Hadoop has gained immense momentum and has a thriving ecosystem of open source projects and vendors. So for many companies, the decision about choosing the right way to exploit big data amounts to choosing among the available Hadoop options. The right way to exploit Hadoop is a decision that is going to be analyzed over and over for the next few years. Here’s my suggestion about finding the right way to put Hadoop to work.
Four Ways to Get Value from Hadoop
Here are the four ways to categorize how to get value out of Hadoop:
Use a distribution from a completer, a company that takes the Apache open source Hadoop code and adds proprietary extensions to complete it for a particular purpose.
Use a distribution from a builder, a company that focuses on building functionality that is then shared by all distributions and provides services to make the core distribution more productive.
Using an offering from an embedder that takes advantages of what Hadoop can do by incorporating it into another product or environment.
Becoming a customizer, using your own team to adapt the Apache Hadoop code to your purposes.
What is often lost in discussions of big data and Hadoop is the fact that the point is not to choose the “best” technology in some abstract sense, but to choose the technology that best fits your needs. Once you understand your needs, usually through analysis and experimentation, this framework can help accelerate the process of finding the right fit.
Completers: Providing Enhanced Functionality and Improved Engineering
In terms of amount of proprietary functionality created, MapR Technologies is the leader of the completers, a group that includes Cloudera, Hadapt, IBM, WANdisco, and others. To be a member of the completers, a company doesn’t have to only act as a completer, but must act as a completer is some significant way by adding extra functionality to Hadoop or replacing some part of part of Hadoop. For example, Cloudera also acts as a builder in addition to being a completer.
The completers view Hadoop as a work in progress that performs better when it is improved in specific ways to meet market needs. Instead of waiting for the builders to fix every problem, the completers take the Hadoop code as a foundation and add unique extensions to fill gaps in functionality or to improve the engineering of the implementation. The business model of the completers is primarily to charge a fee for support of an enhanced distribution, although these firms may also offer training and services as well.
One of the first ways that Hadoop was extended in a proprietary fashion was through the creation of administrative, management, and monitoring products to make it easier to install, configure, and operate Hadoop. Both Cloudera and MapR offer such a proprietary framework.
I consider MapR the leader of the completers because the company has been the most aggressive in both extending Hadoop and in fixing problems with the engineering of the implementation. MapR, for example, has rewritten the HDFS implementation which eliminates the dependency on the Linux file system to make it more efficient for large numbers of smaller files, to address some challenges with maintaining the name node, and to support the NFS protocol, which allows many other programs to access files in HDFS in a read/write manner. MapR has also created extensions for supporting security based on SSL. Cloudera, in addition to its management application has also added Cloudera Navigator, a data management application, as a proprietary extension.
The completers also extend Hadoop by creating open source software. Many of these projects, such as Impala and Cloudera Search, both created by Cloudera, and Apache Drill for which MapR commits and leads development, are developed almost exclusively by the staff of those companies. I have not done my homework on all of these different projects, but MapR, for example, believes it should be considered a builder when it comes to Apache Drill. Cloudera started out dominating Impala but has encouraged others to join the effort. HortonWorks certainly acts as a completer when it founds a new open source project to fill a need. Is an open source project controlled by one vendor a proprietary extension in disguise? Does it matter that some of these projects are run just as GitHub shared projects without formal Apache governance processes? We will leave these questions for another day.
WANdisco represents another type of completer, one that brings a special expertise to Hadoop to enhance it in a particular way. WANdisco’s first business was an extension to the Git source code management system that created a masterless system of replication. Now WANdisco has applied that expertise to creating a distribution that allows Hadoop clusters to be replicated in a masterless fashion. Hadapt has a similar model in which they add the ability to create a robust SQL database powered by Hadoop out of data stored in Hadoop. Read more