Big Data 2.0-the next generation of Big Data
In the last few years we have seen Big Data generate a lot of buzz along with the launch of several successful big data products. The big data ecosystem has now reached a tipping point where the basic infrastructural capabilities for supporting big data challenges and opportunities are easily available. Now we are entering what I would call the next generation of big data — big data 2.0 — where the focus is on three key areas:
1. Speed
Data is growing at an exponential rate, and the ability to analyze it faster is more important than ever. Almost every big data vendor is coming out with product offerings, like in-memory processing to process data faster. Hadoop also launched its new release, Hadoop 2.0 / YARN, which can process data in near real-time. Another big data technology gaining traction is Apache Spark, which can run 100 times faster than Hadoop. Leading Silicon Valley venture capital firm Andreessen Horowitz led a $14 million investment to start a company named Databricks around Apache Spark.
Even the analytics providers are realizing the importance of speed and have built products that can analyze terabytes of data within seconds. This aligns well with the growing presence of sensors/Internet of things in the consumer and industrial world. Sensors can generate millions of events per second and analyzing them in real-time is not trivial. One of our customers faced this challenge recently when their sensor data ballooned to 5TB a day, and they quickly realized the importance of speed while handling such large data volumes.
Data storage costs have come down over the years, but it still continues to be expensive. Most businesses prefer analyzing streaming data in real-time to filter out the noise versus spending money to store the complete data stream.
2. Data Quality
Data quality has never been sexy but becomes more important with data growing at an exponential rate. The speed at which decisions are made has already reached a point where the human brain can’t keep up. This means that based on defined rules, data is cleansed and processed and decisions are made, all without any human intervention. In such environments, a single stream of bad data can act as a virus and result in incorrect decisions or heavy financial loss. A good example is the world of algorithmic trading, where trades are placed every few milliseconds by analyzing stock market trends using algorithms versus a human.
Data quality has become a key part of service level agreements (SLAs) in evolving digital enterprises. Bad quality data can result in blacklisting the data provider/supplier or severe financial penalties. B2B environments are the early adopters as they rely heavily on the quality of data to ensure smooth business operations. Some enterprises are moving in the direction of deploying real-time alerts for data quality issues. The alerts can be sent to the designated person based on the issue and can also suggest recommendations on how to fix the issue.
Machine learning is another technique that is being used to improve data quality. It has made it easier to conduct pattern analysis to identify new data quality issues. Machine learning systems can be deployed in a closed loop environment where the data quality rules are refined as new quality issues are identified via pattern analysis and other techniques. Read more