The Biggest Challenge of Hadoop Analytics: It’s all about Query Performance
As Big Data gets bigger and more complex, scalability and performance turn out to be major areas of concern for business users. When organizations use SQL on Hadoop for their business intelligence, they often find it difficult to cope up with the growing needs of business users who expect instant answers to complex queries.
Forrester estimates that more than 20% of Big Data projects fail every year. So, why do these projects fail?
Business users need their answers and they need them fast. Historically, that meant two technologies were put in place – an Enterprise Data Warehouse that stored vast amounts of data and a SQL Engine that “power users” could use to access the data. In most cases, these data warehouses were difficult to modify when needed and as a result, it became a “black box” to the data consumer. The SQL solution gave the power user access to the underlying data, but it was potentially “hazardous” – what if the user runs an “endless” query, what if they perform the wrong query on the wrong data and return the wrong answer?
Why Hadoop Analytics gets difficult?
The promise of Big Data is performance and storage. Therefore, IT groups see Big Data as the answer to the EDW/SQL conundrum. IT creates a Hadoop platform and opens the data up to everyone. Data analytics tools can connect directly to Hadoop or we can write SQL against it. And because the Hadoop cluster has “infinite storage and processing”, it is expected to provide answer to all questions.
Unfortunately, analytics using SQL on Hadoop is not able to meet these demands. Tools like Hive and Impala take minutes or hours to return the results to a query. Moreover, connecting to Hadoop directly from data analytics tools like Tableau, Microstrategy or Excel to Hive or Impala exacerbate the problem. In short, business users don’t get the performance they demand and need in today’s world.
“Waiting for query” Impacts Cost
On an average, business users and analysts spend 100 minutes every week waiting for their queries to return. To put that into a money perspective, that’s $53 per week per person being wasted directly. That doesn’t include the time and money lost by the manager or executive waiting for the report. So, if we look at the entire enterprise, there could be a few thousand users or more waiting for their data.
The Analyst’s Dilemma
Apart from the waste of their costly time, slow response to queries also creates reluctance about using Hadoop data for analytics.
- If it’s going to take 10 minutes to run that query on the new system, why should I move to Hadoop?
- If I’m going to lose capability and performance, do I really need more data on Hadoop?
- My manager accepted what I gave them in the past from SQL Server – it’s good enough.
These limitations not only hamper adoption of Hadoop in enterprises, but also put IT teams in a difficult spot to get return on their Hadoop investments.
The OLAP on Hadoop Solution
So how do we eliminate these bottlenecks?
Enterprises are turning to OLAP on Hadoop for analytics after discovering that Hive or Impala queries just don’t respond against billions of rows of data. OLAP cubes let you query your business data to gather those same answers. Cubes provide for complex calculations, trend analysis, and sophisticated data modeling.
Big Data solution providers are now offering OLAP on Hadoop solutions which allows users across the enterprise to query and analyze massive volumes of information in seconds. By structuring, calculating, and pre-aggregating Hadoop data, this technology achieves both scalability and performance that business users need. It solves scalability by keeping data in the Hadoop platform – that’s where Big Data is supposed to be. The cubes are also stored on the Hadoop. The performance challenge is addressed by pre-aggregating data in the cube. Thus, when a user asks for sales data by year, it’s just a matter of finding that intersection in the data. The improved query performance not only helps business users to understand trends and identify problems but also develop future strategies based on informed decisions.
Besides performance and scalability, the users can easily conduct Hadoop analytics using their preferred BI and data analytics tools such as Tableau, Qlik, Excel, Power BI, Microstrategy, Business Objects, and more. Not only that, but developers can connect to the cubes using standard libraries in Python, Java, and many other languages.
End the “Wait”
As enterprises generate new ideas to improve and grow, what they need is a Hadoop analytics platform that has the power to decrease the gap between an idea for analytics and the time when those insights are delivered.
The post is by Dhvani Shah, the Manager of Marketing at Kyvos Insights and she started her career working for brands like Airbus and Philips until she realized her passion for Data and Analytics. She’s been part of the Kyvos marketing team from inception and has exhaustive knowledge about the Big Data industry.