This article is part of an Essential Guide, our editor-selected collection of our best articles, videos and other content on this topic. Explore more in this guide:
2. - Data management market and trend analysis by editor Jack Vaughan : Read more in this section
- Data pros face tough data ethics questions
- Commercial cognitive computing not so easy for IBM Watson
- Various NoSQL databases seek to drive out RDBMSs
- Security services provider calms data flood with Hadoop, HBase
- Hadoop's role in modernizing mainframe apps -- or shutting them down
- The Google big data architecture: Not something for everyone to try
- Lucene search and Hadoop combine to surface 'unknown unknowns'
- Storm-on-YARN highlights Hadoop applications dichotomy
- Marketing and advertising driving big data cloud agenda
- Hadoop's evolution sparks new ways to program big data applications
- Bringing data to the masses: 4GLs yield to HTML5 for BI apps
- Big data age requires change in business decision-making process
- Out of the sandbox: Hadoop management tools become more important
Explore other sections in this guide:
- 1. - Business intelligence analysis from editors Craig Stedman and Ed Burns
- 3. - Oracle product and strategy analysis by editor Mark Fontecchio
Cassandra, MongoDB, HBase -- they're just a few of the many NoSQL databases now proliferating. These databases look to solve one problem or another encountered by the steadfast relational database systems (RDBMSs) that have long ruled in the enterprise. But the very variety that makes the NoSQL sector so vibrant can make comparing different products a challenging -- and often fruitless -- proposition for would-be users.
Before looking more at that issue, it's reasonable to ask why any of these NoSQL things matter at all. The short answer is that large-scale distributed processing is taking hold in more applications, thus exposing some of the creaky flooring on which the RDBMS sits. In Web applications and enterprise apps alike, a common theme has been emerging: The relational database may not always be the best fit.
Examples of RDBMS misfits are common. The relational database can be too expensive to grow out in a widely distributed version. It doesn't easily adapt to new styles of data -- for example, the unstructured information that's common in big data applications. It struggles with the massive data volumes coming from in-the-field sensors or Web server activity logs.
As people have found more and more reasons to move work off of incumbent relational databases, what has emerged is a "fit for purpose" mentality of the kind that was a bit more prevalent in the days before the RDBMS became the all-purpose flour in the database server pantry. And the number of NoSQL database options developed to fit various purposes has grown greatly.
Searching for Cassandra
Apache Cassandra is a good example. Like some other NoSQL technologies, the Cassandra database came about because of a big Web 2.0 fish -- in this case, Facebook. The purpose for which Facebook created Cassandra was to enable users of the social network to search their inboxes. When the database was launched in 2008, it supported replication across geographically distributed data centers to quickly service the searches of as many as 100 million users.
Inside, Cassandra is a distributed key-value database that uses a row store scheme and a peer-to-peer (or shared nothing) architecture. Its design incorporates some of the characteristics of Google BigTable and Amazon Dynamo, two early and influential NoSQL databases. Along the way, Cassandra has added support for MapReduce, gained a query language and triggers and refined its support for lightweight transactions and database compaction.
Facebook eventually replaced the Cassandra-based search system with a Hadoop and HBase implementation, but the company ceded the software to open source; a community arose to carry it forward, and Cassandra became a top-level Apache Software Foundation project in 2010.
Mapping to the problems
Cassandra represented a good fit for the needs of Internet Identity, according to Jason Atlas, vice president of technology and engineering at the Tacoma, Wash.-based security services company. Known as IID, the company had a rapidly growing database of IP addresses running on a MySQL RDBMS cluster. But for cost and other reasons, the MySQL path didn't seem tenable going forward.
MORE OF 'TALKING DATA'
IID was harvesting and collecting 600,000 unique IPv4 addresses and host names per week. Related metadata collections were also growing. "We started to see that we couldn't store more than 30 days of information at one time," Atlas said. "The problems largely revolved around scale." He added that the IPv4 data "lent itself to a key-value approach," which ultimately led IID to the DataStax Enterprise version of Cassandra.
Cassandra is built to run on commodity clusters, as might be expected given its Google-Amazon-Facebook lineage. Its focus on scalability bears fruit, in Atlas's estimation: He said it is "coming as close to linear scaling" as anything he has previously seen. He also gives points to DataStax for a Cassandra-MapReduce integration that he expects to use going forward.
But he cautioned those who are looking to embrace Cassandra or other NoSQL databases, offering a reminder that it is unwise to force-fit technologies onto problems. "It's always best to map the problem onto the solution," Atlas said.
How do I NoSQL? Let me count the ways
Sorting through the variety in the NoSQL space is nothing short of daunting. Some NoSQL vendors are becoming household names in database circles -- for example, DataStax and a quartet of other NoSQL database makers (Basho Technologies Inc., CouchBase Inc., MarkLogic Corp. and MongoDB Inc.) were listed among the top vendors of operational database management systems in a recent Gartner Inc. Magic Quadrant report. But there are dozens of NoSQL offerings in several distinct product categories -- and different databases in the same category were built to support different uses. It's all a bit of a maze to navigate.
I caught up with Gartner analyst Merv Adrian on this issue in the Twittersphere. In a tweet, he had pointed to a Linux Journal reader poll comparing NoSQL databases. Adrian deadpanned: "In related news -- do you prefer apples, cocktails or broccoli?" While rolling on the floor laughing, I tweeted him that I thought I understood his point. He tweeted back: "It's useless -- and meaningless -- to compare 'NoSQL' products that are so wildly different in structure and intent."
Atlas made a similar point. "Mongo and Cassandra have nothing to do with one another, but are still both called 'NoSQL.' Their use cases are very different," he said.
Ultimately, we should expect some thinning of the NoSQL ranks. Cassandra is showing signs that it could be one of the survivors. But despite being fit for some specific purposes, it and others under the NoSQL umbrella may need to find more general uses to truly thrive.