Data Lake Showdown: Object Store or HDFS?
The explosion of data is causing people to rethink their long-term storage strategies. Most agree that distributed systems, one way or another, will be involved. But when it comes down to picking the distributed system–be it a file-based system like HDFS or an object-based file store such as Amazon S3–the agreement ends and the debate begins.
The Hadoop Distributed File System (HDFS) has emerged as a top contender for building a data lake. The scalability, reliability, and cost-effectiveness of Hadoop make it a good place to land data before you know exactly what value it holds. Combine that with the ecosystem growing around Hadoop and the rich tapestry of analytic tools that are available, and it’s not hard to see why many organizations are looking at Hadoop as a long-term answer for their big data storage and processing needs.
At the other end of the spectrum are today’s modern object storage systems, which can also scale out on commodity hardware and deliver storage costs measured in the cents-per-gigabyte range. Many large Web-scale companies, including Amazon, Google, and Facebook, use object stores to give them certain advantages when it comes to efficiently storing petabytes of unstructured data measuring in the trillions of objects.
But where do you use HDFS and where do you use object stores? In what situations will one approach be better than the other? We’ll try to break this down for you a little and show the benefits touted by both.
Why You Should Use Object-Based Storage
According to the folks at Storiant, a provider of object-based storage software, object stores are gaining ground among large companies in highly regulated industries that need greater assurances that no data will be lost.
“They’re looking at Hadoop to analyze the data, but they’re not looking at it as a way to store it long term,” says John Hogan, Storiant’s vice president of engineering and product management. “Hadoop is designed to pour through a large data set that you’ve spread out across a lot of compute. But it doesn’t have the reliability, compliance, and power attributes that make it appropriate to store it in the data lake for the long term.”
Object-based storage systems such as Storiant’s offer superior long-term data storage reliability compared to Hadoop for several reasons, Hogan says. For starters, they use a type of algorithm called erasure encoding that spreads the data out across any number of commodity disks. Object stores like Storiant’s also build spare drives into their architectures to handle unexpected drive failures, and rely on the erasure encoding to automatically rebuild the data volumes upon failure.
If you use Hadoop’s default setting, everything is stored three times, which delivers five 9s of reliability, which used to be the gold standard for enterprise computing. Hortonworks architect Arun Murthy, who helped develop Hadoop while at Yahoo, pointed out at the recent Hadoop Summit that if you only storing everything twice in HDFS, that it takes one 9 off the reliability, giving you four 9s. That certainly sounds good. Source