Hadoop-5 Undeniable Truths
Everyone knows sensationalist headlines can be distracting or inaccurate. The real problem with such overblown headlines is this: Superficial debates are slowing down the true potential of Hadoop, big data, and the evolution of traditional databases.
At Qubole we often receive calls from potential customers who are confused about Hadoop’s capabilities. They believe that Hadoop is the savior for their newest analytics project and that it can replace all the functions of their existing data warehouses. Confusion at such a basic level leads to a poor customer experience, wasted dollars, and headaches — all which could be avoided with better education.
So let’s set the record straight. Here are 5 simple truths about Hadoop:
1. You still need a traditional data warehouse. Traditional data warehouses allow for high-fidelity data and subsequent analysis, which are ultimately fundamental to businesses. Data warehouses make powerful use of structured, relational data, whereas Hadoop excels at managing unstructured, semi-structured or log data that classic data warehouses can’t handle well. The two make an attractive odd couple.
2. Hadoop isn’t great at real-time analytics. Hadoop is a great fit for staging vast amounts of raw data in order to extract summaries that can then be loaded into traditional enterprise data warehouses to conduct low-latency analytics. Real-time analytics, while making great advances on Hadoop with tools such as Presto and Apache Spark, are still best served by the traditional databases.
3. A Hadoop-only strategy is dangerous. Why would you need a Hadoop solution to process 10 GB of highly structured data? Yet we run into customers wanting to use Hadoop for exactly those small-scale needs. Sacrificing traditional data warehouses and relying solely on Hadoop is a dangerous move. A traditional database is still a necessity for managing day-to-day business operations, and the majority of businesses simply don’t have the resources or expertise needed to run a Hadoop cluster for every data query. Given how critical it is for a Hadoop initiative to prove initial return quickly, attempting to use the platform in ways it is not intended to be used will create disillusionment toward Hadoop and its true capabilities.
4. Hadoop is difficult to use. Praise for Hadoop and promise of big data has created a magical haze around the technology that can mask its complexity. An investment in Hadoop requires an investment in a cluster management team in addition to infrastructure. Many Hadoop users come to us or another managed service provider because they didn’t have the capacity to manage their Hadoop clusters or scale up to meet their customers’ demands. With a limited budget, the question came down to hiring new talent and investing in additional clusters or denying requests.
5. You don’t need the whole ecosystem. If your organization is actively involved in the open source community, you don’t need to use the entire zoo of Hadoop tools. However, those on the business end frequently misunderstand the purpose of the tools and don’t know that their business probably requires the use of only one or two engines. For example, we would advise those looking for a SQL on Hadoop tool for data exploration to turn to Presto or another SQL-on-Hadoop option rather than Hive, since Hive doesn’t offer interactive speeds.
The Hadoop community must make greater efforts to educate users and correct misinformation about Hadoop’s capabilities. It’s ironic that in the information age, finding accurate and comprehensive information is so difficult and that some of the most helpful conversations continue to take place offline or in forums like Quora. With such a large informational hole to fill, why shouldn’t the industry that’s rapidly changing the way businesses think about data also change the way we market that technology? To start, let’s stop marginalizing other technologies and start playing nicely with others in the field. source