How to select the best DBMS software: A buyer's guide
A collection of articles that takes you from defining technology needs to purchasing options
Reviewing NoSQL database management system (DBMS) offerings is a more difficult task than comparing and contrasting relational DBMSes, as there's more than one type of NoSQL database and a large number of individual NoSQL DBMSes.
Part of the process of vetting a NoSQL DBMS is to narrow down the choices to a manageable number. There are more than 100 NoSQL DBMSes listed on the NoSQL-org.com website, and it isn't possible to review every candidate. Here, we examine several leading database products in each of the four main NoSQL categories -- document store, key-value, graph database and column family store -- and look at the considerations involved in evaluating the practicality of each.
Before going down the NoSQL path, it's a good idea to first determine whether your existing DBMS software can be used. NoSQL is a hot industry trend right now, and sometimes technicians get excited about new technologies. However, in some cases, the tried-and-true technology already in place at your site -- like a relational DBMS -- is a better choice. So make sure you understand the problem you're trying to solve, and that NoSQL is the proper fit.
Document store DBMSes
Document databases work well for event logging, online shopping, content management and in-depth analytical processing. The schema flexibility of document databases can also be useful for projects requiring rapid prototyping.
One of the leading NoSQL DBMSes is MongoDB, an open source document store DBMS. It's designed to make it easy to develop and run modern applications that rely on structured and unstructured data while delivering scalability and high availability, and supporting rapidly changing data. Due to the popularity of MongoDB, there are probably more technicians familiar with it than any other NoSQL DBMS, making it somewhat easier to staff MongoDB projects.
MongoDB stores data as documents in a binary JSON representation called Binary JSON (BSON). BSON extends JSON representation to include additional types. MongoDB is specifically designed for rapidly building applications that scale globally and are inexpensive to operate. However, data consistency can be an issue with MongoDB. If read operations are allowed on secondary nodes, only eventual consistency is guaranteed.
Another option is Couchbase Server, a JSON-based document store derived from CouchDB, which is an Apache open source project. As with most NoSQL offerings, Couchbase Server delivers eventual consistency for transactions, as opposed to ACID (atomicity, consistency, isolation, and durability).
A strength of Couchbase Server is its Web administration user interfaces, which provide metrics per cluster, per node and per bucket. Many NoSQL offerings rely on command line interface (CLI) administration, but Couchbase Server administration tasks can be performed using the Web, CLI or RESTful API.
Another option is MarkLogic Server, an enterprise document database platform. MarkLogic Server is commercially licensed and supported by the vendor, MarkLogic.
MarkLogic Server can handle JSON, XML and resource description framework (RDF) data natively, and offers critical enterprise features such as ACID transactions, automated failover and security.
Other document store offerings exist with specific niches, such as RavenDB, an open source document store with strong .NET features; Pivotal GemFire, with in-memory design; and Apache Jena, a Java framework for building semantic Web applications using RDF.
A key-value database is ideal when most of the access to data is done using a key, which is a unique identifier for some item of data. The key-value approach is somewhat similar to the document approach. Both offer flexible schemata, but the data in a key-value store isn't structured using a markup language like JSON. Instead, the key-value database uses a key to get access to a bunch of data, where the data can vary from record to record.
A key-value database is similar to a document store in many ways; however, a document store embeds metadata associated with the content, enabling the user to query the data based on its contents. Key-value databases excel at session management, serving ad content and managing user or product profiles. When data is encoded in many different ways without a rigorous schema, using a key-value database can make sense.
One of the leading key-value DBMSes is Redis, an open source, BSD-licensed, key-value data store. Redis is set up using a configuration file that contains parameters to specify the working directory and to control Redis' behavior. At its core, Redis is a key-value store, but it also supports different kinds of data structures. Whereas with traditional key-value stores you associate string keys to string values, in Redis the value isn't limited to a simple string but can also hold more complex data structures.
Another NoSQL key-value DBMS option is Riak from Basho Technologies. Riak is a fault-tolerant, highly available, scalable, distributed multimodel DBMS. Riak open source is free under the Apache 2 license whereas Riak Enterprise requires a commercial license agreement, sold by Basho Technologies.
Riak is engineered to support rapid development and ease of management. Although its heritage is key-value, Riak is more accurately termed a multimodel platform, supporting key-value, object store and search capabilities all from the same platform.
Riak is an open source, distributed DBMS that's implemented across multiple servers. It delivers high availability by maintaining normal functionality during hardware and network failures. Data is sent across nodes so it's evenly distributed around the cluster, making it easy to add or remove nodes. Riak is a masterless system, which means any server can respond to read or write requests. If one server fails, other servers will continue to act upon client requests.
Another key/value option worth considering is Amazon DynamoDB, a cloud-based NoSQL DBMS that provides high performance with low latency. It may be of particular interest to SMBs looking for a database as a service implementation billed based on your capacity.
A column store NoSQL DBMS allows you to store data with keys mapped to values and the values grouped into families that are often accessed together. A column database is well-suited for data where writes are uncommon and applications need to access a few columns of many rows all at once.
Column stores work well for event logging, content management and counting/categorizing for analytics. Column stores are also useful when you have expiring data because you can set up a column to automatically expire.
Apache Cassandra is one of the top NoSQL column family DBMSes. It's an open source DBMS, originally developed at Facebook and later released as an open source project, and is therefore freely available to download and use.
DataStax, a commercial vendor, has created an enterprise-level version of Cassandra with support called DataStax Enterprise. DataStax Enterprise is free to use in development environments; use in production requires the purchase of a license (or enrollment in the startup program). DataStax offers subscriptions for both production and non-production environments that include certified software and support.
Apache Cassandra is designed to be used by online applications that require fast performance with no downtime. It was engineered to handle very large amounts of data spread out across commodity servers to deliver high availability without a single point of failure.
Apache HBase is another leading open source NoSQL column store. Designed to deliver random, real-time, read/write access to large amounts of data using commodity hardware, HBase is modeled after Google's Bigtable storage system. It's built on top of Hadoop and Hadoop Distributed File System (HDFS).
The current version of Apache HBase is 1.3.
Although Hadoop and HBase are open source projects there are commercial providers, too, such as Cloudera, which offers Cloudera Enterprise. At the core of Cloudera Enterprise is CDH, which combines Apache Hadoop and other open source projects into a single, highly scalable system for analytical processing. Of course, Cloudera isn't the only commercial provider; for example, Hortonworks and MapR Technologies are other leading providers of Hadoop distributions that include HBase.
The graph database NoSQL category focuses on relationships between values and stores data using graph structures with nodes, edges and properties. In a graph database every element contains a direct pointer to its adjacent element and no index lookups are necessary.
Before looking into a graph database provider, make sure your intended use works well with the graph model. Such use cases include social media (relationship management), search, network and IT operations, fraud detection, real-time recommendations, digital asset management and master data management -- essentially any application that benefits from harnessing the power of data relationships using graphs.
The leading graph database is Neo4j. Neo4j is a native graph database system, where things are stored as nodes and relationships between things building the structure of the database. From the data model to the query language, to the storage engine, relationships are built into the structure of every Neo4j database.
Neo4j offers ACID transactions, high-availability clustering for enterprise deployments, and comes with a Web-based administration tool. Neo4j isn't new technology; the company has been in business for more than a decade.
Another popular graph DBMS choice is Titan, which is optimized for storing and querying graphs represented over a cluster of machines. This approach is good for workloads that will grow, as the cluster can elastically scale as the amount of data increases and the number of users expands. Titan has a pluggable storage architecture that allows it to build on proven database technology such as Apache Cassandra, Apache HBase or Oracle BerkeleyDB.
DataStax Enterprise, which is built on Apache Cassandra, delivers a multimodel DBMS platform with graph capabilities inspired by the open source Titan graph database. Choosing a multimodel approach can make sense for applications needing several different NoSQL approaches (such as key/value for some data and graph for others).
Additionally, many RDBMS offerings can support graph data in the form of RDF triples, including IBM DB2, Microsoft SQL Server and Oracle.
Most NoSQL DBMS offerings are open source and, as such, can be licensed for free under an open source license or via a commercial license from a vendor that offers support and upgrades. The commercial option is recommended for organizations intending to use NoSQL databases in production applications and systems.
Another choice: The multimodel DBMS
Yet another choice in the NoSQL market is the multimodel DBMS. A growing number of vendors have delivered DBMS products that support more than one (or all) of the NoSQL models (and in some cases, relational, too). Examples of multimodel NoSQL vendors include DataStax Enterprise, FoundationDB, CortexDB and OrientDB.
Keep in mind, too, that your existing relational DBMS may also be an option. The relational vendors are working to expand their DBMSes to embrace NoSQL, and some have already started to introduce NoSQL capabilities.
One example is IBM DB2. The BLU Acceleration capability extends DB2 for Linux, Unix and Windows with a column store capability, albeit a relational column store. But DB2 also has the ability to store RDF graph triples and JSON documents, which may obviate the need for DB2 users to acquire a graph or document database.
Cutting through the NoSQL DBMS clutter
Overall, the NoSQL DBMS market is crowded and confusing. There's a lot of good technology available, as well as a plethora of vendors to consider. We've provided an overview of several leading NoSQL DBMSes, along with considerations for when it's practical to utilize them. Keep in mind, though, that simply because a product wasn't covered or mentioned here doesn't mean it isn't worthy of consideration. Keeping your eye on what best fits the intended use will help you cut through the clutter in the crowded world of NoSQL.
About the author
Craig S. Mullins is a data management strategist, researcher, consultant and author with more than 30 years of experience in all facets of database systems development. He is president and principal consultant of Mullins Consulting Inc. and publisher/editor of TheDatabaseSite.com. Email him at firstname.lastname@example.org.
This article was updated in November 2016.
How much do you know about NoSQL database?
Listen to this podcast on the Redis creator
NoSQL DBMS basics
How do graph databases relate to NoSQL?