Apache SparkOver the past couple of years, as Hadoop has become the dominant paradigm for big data processing, several facts have become clear. First, the Hadoop Distributed File System is the right storage platform for big data. Second, YARN is the resource allocation and management framework of choice for big data environments. Third, and maybe most important, there is no single processing framework that will solve every problem. Although MapReduce is an amazing technology, it doesn’t address every situation.

Businesses that rely on Hadoop need a variety of analytical infrastructures and processes to find the answers to their critical questions. They need data preparation, descriptive analysis, search, predictive analysis, and more advanced capabilities like machine learning and graph processing. Also, businesses need a tool set that meets them where they are, allowing them to leverage the skill sets and other resources they already have. Until now, a single processing framework that fits all those criteria has not been available. This is the fundamental advantage of Spark.

Though Spark is a relatively young data project, it has met all of the above requirements and more. Here are five reasons to believe that we have entered the age of Spark.

1. Spark makes advanced analytics a reality

While a majority of large and innovative companies are looking to expand their advanced analytics capability, at a recent big data analytics event in New York, only 20 percent of the participants reported that they are currently deploying advanced analytics across the enterprise. The other 80 percent said their hands are full simply preparing data and providing basic analytics. The few data scientists these companies have spend most of their time implementing and managing descriptive analytics.

Spark provides a framework for advanced analytics right out of the box. This framework includes a tool for accelerated queries, a machine learning library, a graph processing engine, and a streaming analytics engine. As opposed to trying to implement these analytics via MapReduce, which can be nearly impossible even with hard-to-find data scientists, Spark provides prebuilt libraries that are easier and faster to use. This also frees the data scientists to take on tasks beyond data preparation and quality control. With Spark, they can even ensure correct interpretation of the analysis results.

2. Spark makes everything easier

A longtime criticism of Hadoop is that it is hard to use and even harder to find people who can use it. Although Hadoop has become simpler and more powerful with every new version, this critique has persisted into the present day. Instead of requiring users to understand the various complexities, such as Java and MapReduce programming patterns, Spark is made to be accessible to anyone with an understanding of databases and some scripting skills (in Python or Scala). That makes it easier for businesses to find people who can understand their data as well as the tools to process it. And it allows vendors to develop analytics solutions faster and bring new innovation to their customers sooner.

3. Spark speaks more than one language

At this point, it may be fair to ask: If SQL didn’t already exist, would we invent SQL today to address the challenges of big data analytics? Probably not — at least not SQL alone. We would want more flexibility in getting at the answers we need, more options for organizing and retrieving data, and faster ways of moving the data into an analytics framework. Spark leaves the SQL-only mind-set behind, opening the data to the quickest and most elegant way of initiating analysis, whatever that might be for the data and business challenge at hand.

4. Spark accelerates results

As the pace of business continues to accelerate, the need for real-time results continues to grow. Spark provides parallel in-memory processing that returns results many times faster than any other approach requiring disk access. Instant results eliminate delays that can significantly slow incremental analytics and the business processes that rely on them. As vendors begin to leverage Spark to build applications, dramatic improvements to the analyst workflow will follow. Accelerating the turnaround time for answers means that analysts can work iteratively, honing in on more precise, and more complete, answers. Spark lets analysts do what they are supposed to do: find better answers faster.

5. Spark doesn’t care which Hadoop vendor you use

All of the major Hadoop distributions now support Spark, with good reason. Spark is a vendor-neutral solution, meaning that implementation doesn’t tie the user to any one provider. Because Spark is open source, businesses are free to create a Spark-based analytics infrastructure without having to worry about whether they might change Hadoop vendors at some point down the road. If they change, they can bring their analytics with them. Source


Subscribe to our Newsletter

Get The Free Collection of 60+ Big Data & Data Science Cheat Sheets. Stay up-to-date with the latest Big Data news.