MongoDB, Cassandra, and HBase-the three NoSQL databases to watch
Hadoop gets much of the big data credit, but the reality is that NoSQL databases are far more broadly deployed — and far more broadly developed. In fact, while shopping for a Hadoop vendor is relatively straightforward, picking a NoSQL database is anything but. There are, after all, in excess of 100 NoSQL databases, as the DB-Engines database popularity ranking shows.
Which should you choose?
Spoiled for choice
Because choose you must. As nice as it might be to live in a happy utopia of so-called polyglot persistence, “where any decent-sized enterprise will have a variety of different data storage technologies for different kinds of data,” as Martin Fowler argues, the reality is you can’t afford to invest in learning more than a few.
Fortunately, the choice is getting easier as the market coalesces around three dominant NoSQL databases: MongoDB (backed by my former employer), Cassandra (primarily developed by DataStax, though hatched at Facebook), and HBase (closely aligned with Hadoop and developed by the same community).
Note that I purposefully exclude Redis from this list. While a great data store, it’s primarily used for caching data and isn’t well suited for a wide array of workloads.
LinkedIn data from 451 Research shows how the market is gravitating to MongoDB, Cassandra, and HBase:
That’s LinkedIn profile data. A more complete view is DB-Engines’, which aggregates jobs, search, and other data to understand database popularity. While Oracle, SQL Server, and MySQL reign supreme, MongoDB (no. 5), Cassandra (no. 9), and HBase (no. 15) are giving them a run for their money.
While it’s too soon to call every other NoSQL database a rounding error, we’re rapidly reaching that point, exactly as happened in the relational database market.
To better understand why these three databases shine, I asked representatives from each to identify key attributes for their success: Kelly Stirman, director of products at MongoDB; Patrick McFadin, chief Cassandra evangelist at DataStax; and Justin Kestelyn, senior director of developer relations at Cloudera.
But first, we need to understand why NoSQL matters.
A world built with unstructured data
We increasingly live in a world where data doesn’t fit nicely into the tidy rows and columns of an RDBMS. Mobile, social, and cloud computing have spawned a massive flood of data. According to a variety of estimates, 90 percent of the world’s data was created in the last two years, with Gartner pegging 80 percent of all enterprise data as unstructured. What’s more, unstructured data is growing at twice the rate of structured data.
As the world changes, data management requirements go beyond the effective scope of traditional relational databases. The first organizations to observe the need for alternative solutions were Web pioneers, government agencies, and companies that specialize in information services.
MongoDB: Of the developers, for the developers
Among the NoSQL options, MongoDB’s Stirman points out, MongoDB has aimed for a balanced approach suited to a wide variety of applications. While the functionality is close to that of a traditional relational database, MongoDB allows users to capitalize on the benefits of cloud infrastructure with its horizontal scalability and to easily work with the diverse data sets in use today thanks to its flexible data model.
Cassandra: Safely run at scale
There are at least two kinds of database simplicity: development simplicity and operational simplicity. While MongoDB rightly gets credit for an easy out-of-the-box experience, Cassandra earns full marks for being easy to manage at scale.
As DataStax’s McFadin told me, users tend to gravitate to Cassandra the more they butt their heads against the difficulty of making relational databases faster and more reliable, particularly at scale. A former Oracle DBA, McFadin was elated to discover that “replication and linear scaling are primitives” with Cassandra, and the features were “the primary design goal from the beginning.”
In the RDBMS world, database features like scaling and replication are the hard parts left to the user. This worked fine in yesterday’s enterprise when scale wasn’t a big issue. Today it’s quickly becoming the issue.
HBase: Bosom buddies with Hadoop
HBase, like Cassandra a column-oriented key-value store, gets a lot of use in large part because of its common pedigree with Hadoop. Indeed, as Cloudera’s Kestelyn put it, “HBase provides a record-based storage layer that enables fast, random reads and writes to data, complementing Hadoop by emphasizing high throughput at the expense of low-latency I/O.” source