Apache Spark is now part of MapR’s Hadoop distribution
Hadoop vendor MapR is getting in early on the Apache Spark action, too, announcing on Thursday that it’s adding the Spark stack to its Hadoop distribution as part of a partnership with Spark startup Databricks (Ion Stoica, the co-founder and CEO of which, is pictured above). Spark allows for faster processing and easier programming of big data workloads.
An in-memory processing framework originally developed at the University of California, Berkeley, Spark has been rising in popularity over the past year or so, but it really hit the mainstream with the launch of Databricks in September 2013. Since then, Cloudera has added Spark to its Hadoop distribution (as part of a partnership with Databricks), the Apache Spark project has reached top-level status, and numerous projects and companies originally designed with Hadoop in mind are planning to support Spark or move to it whole hog.
These include Cloudera’s Oryx project, analytics startup Platfora and even the Apache Mahout project, as well companies participating in Databricks’ certification program for Spark.
Spark is arguably so popular right now as much because of what it is as what is isn’t: MapReduce. The traditional data-processing framework for Hadoop, MapReduce is slow (it’s a batch processor) and notoriously difficult to program. Spark is fast and flexible — making it better for tasks such as machine learning, graph processing and interactive queries — and easy to program. It’s written in Scala, but also supports programming in Java, Python and, in time, R.
Much of this support for Spark is possible because of YARN, the resource-management system that’s part of Hadoop 2.0 and lets numerous processing frameworks run simultaneously on the same cluster, all accessing the Hadoop Distributed File System for storage. Source