Hadoop gets native R programming for big data analysis
Sensing a growing interest in big data-style analysis, software provider Revolution Analytics has updated its flagship package of R statistical functions so it can be run with the Hadoop data processing platform.
Revolution R Enterprise 7 (RRE 7), to be made available on Monday, also features the ability to run R within Teradata databases as well.
The R language provides a way to run common statistical tests—such as linear and nonlinear modelling, time-series analysis, classification, and clustering—on a set of data, often portraying the results in graphical form.
R is becoming increasingly popular for sophisticated data analysis that goes beyond what can be offered by more standard business intelligence (BI) packages. Revolution Analytics has estimated that over 2 million people use R worldwide.
RRE7 includes a library of R algorithms that can be run in parallel across multiple nodes, which is how Hadoop manages large data sets. RRE 7 can be added to the Cloudera CDH3 and CDH4 Hadoop distributions as well as Hortonworks Data Platform 1.3.
The new R library includes the most commonly used statistical and predictive analytics algorithms for tasks such as data processing, data sampling, descriptive statistics, statistical tests, data visualization, simulation, machine learning and predictive models.
By analyzing the data within the node in which it resides, rather than moving it somewhere else to be analyzed, R-based data analysis can done more quickly, according to Revolution Analytics. It also allows an entire set of data to be analyzed, rather than a subset or summary of the data, which is the approach typically taken with enterprise data warehouses (EDWs).
Revolution Analytics hopes the incorporation of R within Hadoop and the Teradata databases will also broaden the use of the language to line-of-business managers. The company has designed a new workflow interface that does not require knowledge of how to implement specific R algorithms. This eliminates the hassle of coding R with Java, or some other language, in order to have it run on the Hadoop platform.
In addition to supporting these new platforms, RRE7 also features a number of new algorithms and processes. One is a collection of models for setting up Decision Forests, a machine learning technique for predicting future outcomes. A new batch of Stepwise Regression functionalities can help automate the process of selecting the most important variables to be used in a predictive model. A new Decision Tree visualization can provide a graphical way for depicting complex relationships and correlations within a set of data. Source