Putting the R Into Hadoop
With the amount of data being fed into Hadoop these days, it’s natural for customers to want to do some statistical analysis upon it. Revolution Analytics, which distributes a commercial version of the R statistical environment, says it has lowered the bar of entry for R programmers to work with Hadoop by parallelizing many of the R algorithms.
There is currently large demand for senior data scientists who can write parallel programs for Hadoop. That’s a lucrative thing for somebody with the right skills, but an expensive proposition for an organization looking to parse their big data in new and profitable ways.
Bill Jacobs, director product marketing, at Revolution Analytics, has heard some horror stories about the “blood money” companies are paying for top Hadoop skills, particularly those who know statistics.
Jacobs says he heard from a CIO who was between a rock and hard spot. “He had to go out and pay $300,000, an ungodly amount of money, to hire a data scientist because he needed someone who could understand heavy statistics, Hadoop modeling, and all the stuff that’s easily done in R,” he says in an interview. “But he needed someone who could do it in Java on Hadoop, and that is a very, very sought after resource.”
With the launch of Revolution R Enterprise 7.0, expected later this year, Revolution Analytics will have parallelized many of the algorithms in the R library, and otherwise streamlined the capability for them to run under the Hadoop distributions from Hortonworks and Cloudera.
This will open up new big data opportunities for the population of 2 million R programmers around the world, and allow them to use their skills against Hadoop-based data, without needing to know Java, Python, MapReduce, or how to write parallel algorithms. (They will, of course, need the package from Revolution Analytics to enable this.)
“We bring the corporation that’s going into the Hadoop world a chance to tap a huge–and, particularly, modestly priced–talent base as opposed to a Ph.D Stanford bioinformatics statistician. Those are $300,000 per year resources,” Jacobs says. “Java is a lovely thing if you’re a Java programmer. But you’re a statistician. You didn’t learn Java in school. You’re an R programmer.”
Revolution Analytics isn’t giving the Hadoop treatment to all of the 4,700 or so algorithms in the CRAN library. But a good number of them are, Jacobs says, including algorithms such as those for generalized linear model, logistics regression, linear regression, stepwise linear, and k-means clustering.
This is not the first time that Revolution Analytics has tackled the big elephant in the room. About nine months ago it released a package that enabled the R language to run against Hadoop file systems. However, it left a lot to be desired in the ease-of-use department, and required programmers to have strong Hadoop skills and to write algorithms in ways that would tolerate parallelism. By Alex Woodie read more