Top 10 Tips for Scaling Hadoop

Data locality is about making sure a big data set is stored near the compute that performs the analytics. For Hadoop, that means managing DataNodes that provide storage for MapReduce to perform adequately. It works effectively, but leads to the separate operational issue of islands of big data storage. Here are some tips on how to manage big data storage in a Hadoop environment.

Decentralize Storage

Centralized storage has been the traditional for some time now. But big data is not really suited to a centralized storage architecture. Hadoop was designed to move computing closer to data while making use of massive scale out capabilities of the HDFS file system, advised Senthil Rajamanickam, FSI Strategy and Operations Manager at Infogix.

The common approach to solving the inefficiencies of Hadoop managing its own data, however, has been to store Hadoop data on a SAN. But that creates its own performance and scale bottlenecks. Now all of your data is processed through centralized SAN controllers, which defeats the distributed, parallelized nature of Hadoop. You either need to manage multiple SANs for various DataNodes, or sane all DataNodes to one SAN.

Hyperconverged v Distributed

Be careful, though not to confuse hyperconverged with distributed. Certain hyperconverged approaches are distributed, but typically the term means your application and storage will be co-resident on the same compute node. That’s tempting to solve the data locality issue, but it can create too much resource contention. The Hadoop application and storage platform will be contending for the same memory and CPU. It’s better to run Hadoop on a dedicated application tier and run your distributed storage in a dedicated storage tier, taking advantage of caching and tiering to solve data locality and network performance penalties, said Lakshman.

Avoid Controller Choke Points

He stressed an important aspect of achieving this – avoid processing data through a single (or maybe dual) point such as a traditional controller. By instead making sure the storage platform is parallelized, performance can be dramatically improved.

In addition, this approach offers incremental scalability. Adding capacity to the data lake is as easy as adding a few x86 servers with flash or spinning disks in them. A distributed storage platform will automatically add the capacity and rebalance the data as necessary.

Deduplication and Compression

A key part of staying on top of big data is deduplication and compression. Hedvig is seeing 70% to 90% data reduction for common big data sets. At petabyte scale, this can mean tens of thousands in disk cost.

“Modern platforms provide inline (as opposed to post-processing) deduplication and compression,” said Lakshman. “That means the data never hits disk without first being reduced in some way, greatly decreasing the capacity needed to store data.”

Consolidate Hadoop distributions

Many large organizations have multiple Hadoop distributions. It may be that developers need access to multiple “flavors,” or business units have adopted different version over time. Regardless, IT often ends up owning the ongoing maintenance and operations of these clusters. When big data volumes really begin to impact a business, the presence of multiple Hadoop distributions introduces inefficiency.

Virtualize Hadoop

Virtualization has taken the enterprise world by storm. Somewhere in excess of 80% of physical servers in many areas are now virtualized. Yet many have avoided virtualizing Hadoop due to performance and data locality issues.

Build an Elastic Data Lake

It isn’t easy to build a data lake, but the demands of big data storage will probably demand it. There are many ways to go about it, but which is the right way? The right architecture should lead to creation of an active and elastic data lake that can store data from all sources and in multiple formats (structured, unstructured, semi-structured). More importantly, it must support the execution of applications right at the data source, and not from a remote source requiring data movement.

“The ideal data lake infrastructure will enable the storage of a single copy of data, and have applications execute on the single data source without having to move data or make copies (for example, between Linux, VMs and Hadoop),” said Fred Oh, Senior Product Marketing Manager, Big Data Analytics, Hitachi.

Integrate Analytics

Analytics is not a new capability, having existed in traditional RDBMS environments for many years. What is different is the advent of open source-based applications and the ability to integrate database tables with social media and unstructured data sources (e.g., Wikipedia). The key is the ability to integrate the multiple data types and formats into one standard so that visualization and reporting can be done more easily and consistently. Having the right tool set to accomplish this is vital to the success of any analytics/business intelligence project.

Big Data Meets Big Video

Big data is bad enough. But an emerging strain of this phenomenon is big video. For example, enterprises increasingly use video monitoring for not only security, but also operational and industrial efficiencies, streamlining traffic management, supporting regulatory compliance and several other use cases. Very soon, these sources will generate ridiculous amounts of content. Those having to deal with it had better make sure they establish the right kind of data store for it, Hadoop-based or otherwise.

No Winner

Hadoop has certainly gained a lot of ground of late. So will it be the ultimate winner, besting all other approaches as big data storage volumes mushroom. Not likely.

Traditional SAN-based architectures, for example, will not be replaceable in the near-term due to their inherent strengths with OLTP and 100% availability needs. But when analytics and data integration with unstructured data is required (e.g., social media), then there can be a compelling argument to evaluate hyper-converged platforms which incorporate server compute, distributed file systems, Hadoop/Spark, and newer database applications with open sourced based analytics tools.

The best approaches, therefore, incorporate hyper-converged platforms with a distributed file system and integrated with analytics software. Traditional Linux-based RDBMS applications (DWO, Data Marts, etc.) serve their purpose, Hadoop/Spark/MapReduce serve new social media challenges, and the use of server virtualization provides flexibilities and efficiencies. But each of these environments may create separate data silos. The ideal approach will support all three simultaneously, add the ability to execute applications at the data source and reduce data movement in an analytics workflow. Source

Related Posts