Trends are, by definition, changes in the status quo over time, so let’s start with a brief review of where we...
are and how we got here. In the really early days of computing, the systems simply mixed the data and the applications together into a hopeless mess, so the first trend was to separate out three components from one another: data, database engine and application.
Once that triad was established, we experimented with several ways of structuring data – the hierarchic and network approaches, for example – before Ted Codd and latterly Chris Date described and popularized the relational database model. Commercial products based on it emerged around 1980, and the relational database management system (DBMS) has been the king of transactional systems since the mid-1990s. Today, most of our structured data is managed by relational engines.
The ‘90s also saw some pretenders to the throne, such as the object-oriented database. At one time, that technology was feted as a replacement for relational databases. But instead, the relational engines simply absorbed some of the “killer” capabilities of their object-oriented counterparts and successfully retained the transaction processing crown.
And if transaction processing was all that we asked databases to do, then the predominant DBMS trend for the future likely would be “nothing much will change.” But we ask database engines to perform many functions, and generally speaking, the more an engine is optimized for one task, the worse it is at another.
Most relational databases are optimized for processing transactions; for example, they’re often excellent at maintaining data integrity in the face of multiple user updates. Such features are highly desirable, of course. But a common side effect is that relational engines are slow at other tasks, such as running complex analytical queries. That particular effect is so severe that in the business intelligence (BI) world there is a tendency to completely separate transactional and analytical workloads. We have a set of relational engines that run the transactional systems (finance, HR, sales, etc.), and then we regularly copy the transaction data to a separate BI system for reporting and analysis.
There the data typically is restructured by dimensions and stored in a multidimensional database. Analytical queries run against the multidimensional engine like greased lightning. In essence, we’ve developed a way of having our cake and eating it, too.
My main prediction for the future is that we’ll see two changes: an increasing use of multidimensional engines as more organizations become interested in advanced analytics, and implementations of relational engines running on massively parallel processing (MPP) so fast that they can handle both transactional and analytical workloads. At least one relatively niche (but very successful) vendor has been partially doing this for years, and recent acquisitions suggest that other mainstream relational database vendors are going to follow suit. If I’m right about this, it will represent the greatest shake-up of database management since the relational engine was developed.
Fast company: NoSQL databases put a premium on performance
But we live in interesting times, and there are other, less cataclysmic but still very important trends that will, I believe, come into play over the next few years. Indeed, we’re already seeing the emergence of new database technologies such as NoSQL, a broad church that itself embraces several sub-technologies. In general terms, they address concerns about the limitations of relational implementations: poor performance on jobs such as indexing, streaming media and serving Web pages on high-traffic sites. In such areas, NoSQL databases can be blisteringly fast when compared to traditional relational engines.
However, as we’ve already said, all design is a compromise, and NoSQL offerings are no exception. Usually, we find that the speed is phenomenal but at the expense of some other aspect of data management. A good example is that most relational databases offer ACID guarantees (i.e., transaction atomicity, consistency, isolation and durability). Some NoSQL database systems simply don’t provide such guarantees, and others offer what’s termed “weak” or “eventual” consistency.
NoSQL databases aren’t necessarily presented by their supporters as alternatives to relational software, and indeed my opinion is that they aren’t. Most businesses will continue to need the mix of strengths that relational databases offer for transaction processing. On the other hand, most of the current relational implementations are inappropriate for use where data consistency isn’t an issue (for example, in read-only data stores) or where high-speed querying, indexing, high-volume data serving and so on are required. There is certainly a place for technologies that address those issues, and as such the NoSQL fraternity is to be welcomed.
So far, this discussion has focused mainly on so-called “structured data.” As it happens, I don’t like that term – the only truly unstructured data is random noise. However, there is plenty of semi-structured data out there that is becoming available for analysis: emails, tweets, blogs, etc. Such information doesn’t lend itself well to relational storage or to analysis, which in part is why we are seeing the emergence of technologies like MapReduce (a programming model for processing large data sets) and Hadoop (a framework for distributed computing and data storage). The two can be used together to find patterns in large amounts of information, including semi-structured data. MapReduce and Hadoop will never replace relational or multidimensional structures, but when used in conjunction with more traditional analytical techniques, they offer users the ability to analyze the two kinds of data together.
So going forward, we may well see a broadening of the type of database engines in mainstream usage. We’ve become very familiar with the relational database model and use relational software as a favorite tool for doing everything, at least in the transactional realm. But that’s inappropriate from an engineering standpoint because of the compromises involved in developing any database technology. In the coming years, we may become much better at choosing the appropriate engine for the job.
About the author:
Dr. Mark Whitehorn specializes in the areas of data analysis, data modeling and business intelligence (BI). Based in the U.K., he works as a consultant for a number of national and international companies and is a mentor with Solid Quality Mentors. In addition, he is a well-recognized commentator on the computer world, publishing articles, white papers and books. He is also a senior lecturer in the School of Computing at the University of Dundee, where he teaches the masters course in BI. His academic interests include the application of BI to scientific research.