hadoopChoosing the right Hadoop distribution can be a tricky process. Many businesses looking to adopt Hadoop in their data infrastructure have a hard time figuring out what really differentiates one distribution from another. With so many options available, it’s easy to get lost in the choices.

There are four basic categories that businesses should look at for specific qualifying criteria. Let’s go through these four categories and see what considerations need to be made in order to sift through the options.

  1. Performance

Hadoop is the data platform chosen by many because it provides high performance – especially if you replace MapReduce in the Hadoop stack with the Apache Spark data processing engine.

But not all distributions carry a comparable level of performance, even with Apache Spark. Certain distributions can scale much easier (with less hardware), handle far larger quantities of data, and maintain higher levels of performance than others. So what should you look for as you evaluate each Hadoop distribution?

Consistent Low Latency

Many distributions will advertise high performance with low latency. The dirty little secret however is that quite often, this latency is unpredictable. The volatility of this performance quality can be staggeringly erratic. If you set up a POC, make sure to keep this in mind.

Distributed Metadata

A determining factor of a Hadoop distribution’s ability to perform at an enterprise level is how their file system architecture is constructed. One key element of the file system to pay attention to is how they manage your metadata. The most flexible, scalable and dependable versions of Hadoop file systems distribute the metadata among the nodes. This can provide your business with upwards of 20x the performance, and allows for high availability functionality for those working with mission-critical applications.

  1. Dependability

Considering that data has become the lifeblood of many businesses today, it’s a shame that dependability hasn’t been a greater priority for data management providers. When looking for a distribution, dependability is a significant differentiator. Very few implementations provide the following dependability features, all of which should be mandatory in your list of priorities.

High Availability

High availability with Hadoop distributions is a rare feature. Only a few implementations can guarantee a system availability of 99.999%. In order to ensure that your provider is giving you real HA functionalities, make sure that they have the following six characteristics:

  • Self-Healing – No human intervention required to immediately solve a system failure
  • No Downtime Upon Failure – The system keeps running no matter what
  • Tolerate Multiple Failures – The system’s failure management capacity can be controlled by administrator preferences
  • 100% Commodity Hardware – No commercial NAS required
  • No Additional HA Hardware Required – System should run HA capabilities on standard commodity hardware
  • Easy to Use – HA should be built in
  • Data Protection

Protecting your data with Hadoop can be done in many ways. The most advanced distributions depend on Snapshot recovery systems. These Point-In-Time Snapshots should utilize the same storage as your live data in order to prevent slowing down your system. Additionally, they should capture data that is both open and closed. Some versions of Hadoop Snapshot systems will only capture data that is closed; this jeopardizes the integrity of your backup data.

Disaster Recovery

Mirroring technology is the preferred method of disaster recovery for enterprise-level Hadoop users. With mirroring, your system should be able to automatically recover from a catastrophic system failure before it’s even noticed.

  1. Manageability

Hadoop has evolved into a user-friendly data management system. Different implementations have done their part to optimize Hadoop’s manageability through different administrative tools. Look for a distribution that has intuitive administrative tools that assist in management, troubleshooting, job placement and monitoring.

  1. Data Access

Gathering and storing your data is just the beginning of what Hadoop is all about. In order to tap into all of your data’s valuable insights, it’s important that it be easily and securely accessible. This starts with some key architectural data access foundations:

  • Full access to the Hadoop filesystem API
  • Full POSIX read/write/update access to files
  • Direct developer control over key resources
  • Secure, enterprise-grade search
  • Comprehensive data access tooling (e.g. Apache Flume, Apache Sqoop, Hive)

In addition to these foundational characteristics, look for security options that come enabled straight out of the box. Security features for Hadoop are commonly left available but unused by many administrators because their implementation has neglected to enable them with their Hadoop offering. The cost and time necessary to maximize security features can be daunting. The best option is to simply choose a distribution that has done the heavy lifting for you. Source