Hadoop-elephant-300x248The emergence of YARN for the Hadoop 2.0 platform has opened the door to new tools and applications that promise to allow more companies to reap the benefits of big data in ways never before possible with outcomes possibly never imagined.  By separating the problem of cluster resource management from the data processing function, YARN offers a world beyond MapReduce: less-encumbered by complex programming protocols, faster, and at a lower cost.

Yet while many Hadoop applications have migrated and other migrations are in process, most of these applications still cling to the original Hadoop paradigm:  MapReduce.  That’s like putting lipstick on a pig (no pun intended). These programs basically dress up the same functionality without taking advantage of the new capabilities of YARN.  Why is YARN important?  Some background may help.

Hadoop was first developed in 2005 by Doug Cutting and Mike Carafella with the help and blessing of Yahoo, which to this day runs the largest Hadoop cluster in the world.  Hadoop was open-sourced under the auspices of Apache, and major contributors include Hortonworks, Yahoo, Cloudera, and many others.  Throughout Hadoop’s development, until October 2013 with the release of Hadoop 2.0, MapReduce was the computational framework.  If you wanted to crunch data under Hadoop, you wrote or generated MapReduce code.  Hadoop 2.0 changed that.

Under Hadoop 2.0, MapReduce is but one instance of a YARN application, where YARN has taken center stage as the “operating system” of Hadoop.  Because YARN allows any application to run on equal footing with MapReduce, it opened the floodgates for a new generation of software applications with these kinds of features:

More programming models. Because YARN supports any application that can divide itself into parallel tasks, they are no longer shoehorned into the palette of “mappers,” “combiners,” and “reducers.”  This in turn supports complex data-flow applications like ETL and ELT, and iterative programs like massively-parallel machine learning and modeling.

Integration of native libraries. Because YARN has robust support for any executable – not limited to MapReduce, and not even limited to Java – application vendors with a large mature code base have a clear path to Hadoop integration.

Support for large reference data. YARN automatically “localizes” and caches large reference datasets, making them available to all nodes for “data local” processing.  This supports legacy functions like address standardization, which require large reference data sets that cannot be accessed from the Hadoop Distributed File System (HDFS) by the legacy libraries.

Despite these innovations, most Hadoop software developers are stuck in the Hadoop 1.0 mindset.  They’ve sacrificed a “bigger leap” to broader availability and greater usability of Hadoop 2.0’s powerful resources in exchange for early market entry. The effect for users:  Hadoop still has a tall fence around it. Most Hadoop applications still suffer from one or more of these deficiencies:

• They feel like programming tools, exposing too much Java or scripting.

• Their “in Hadoop” software is a small feature subset of their “legacy” software.

• They don’t run in Hadoop at all, instead pushing queries through Pig or Hive, and are limited by the volume of data that can be pulled from Hadoop to the “outside.”

• They generate MapReduce, which while not a problem in theory, tends to make applications feel like “MapReduce veneers.”

Fortunately, ISVs are starting to realize that the power of Hadoop 2.0 lies in enabling applications to run inside Hadoop, without the constraints of MapReduce.  Vendors like my company, RedPoint Global, as well as Revolution Analytics, Actian, and Talend are starting to create applications that, to greater or lesser extent, feel like more than glossy MapReduce programming veneers.

One of the most exciting developments is a new crop of “visual data-flow design” applications.  These applications have been around for years, even decades, in the classic world of ETL, ELT, data quality, and analytic databases.  These mature products are used continuously by thousands of non-programmers to solve data problems including marketing analytics, fraud detection, clickstream monitoring, replication, and master data management. The accessibility of these solutions to analysts and “data scientists” is critical.

MapReduce Expertise Hard to Come By

MapReduce is a software framework that developers have been using for years to generate programs for Hadoop.  While the popularity of Hadoop has grown—advanced even more thanks to the hype around big data—the number of MapReduce programmers hasn’t climbed as fast. The bulk of them can be found in Internet companies and flashy start-ups, and if you’re a large company you might have a shot at hiring a few of them.  But big demand and low inventory means companies are paying a premium for MapReduce skills.  Read more here