Hadoop, Spark and other new technologies for handling the massively increasing amounts of loosely structured data being collected by companies are a great help in improving the effectiveness of business intelligence (BI) and analytics initiatives. However, there's one part of the big data argument that falls flat: proudly proclaiming the benefits of using a NoSQL approach to database design. It comes from some NoSQL proponents not really understanding key technology and business concepts -- or the real nature of the SQL vs. NoSQL debate.
Why the confusion? People, even technical ones, get used to certain things in their environment. It's like the old saying about fish not thinking much about water. The problem in the case of databases is that two different things grew up in concert and people often forget that they're different. I'm talking about the relational database management system (RDBMS) and the Structured Query Language (SQL).
The RDBMS is a mechanism for storing and managing information. As the D implies, it's a database, a set of information that can be manipulated. While it's the dominant database technology, the RDBMS is just one type of database structure -- many others also exist. For fellow old folks, remember IMS? Still sold by IBM, it's a hierarchical database. Today, we also have columnar databases, object databases, graph databases, document databases and more -- some of them grouped together under the NoSQL banner.
SQL, on the other hand, is a way of accessing database information through a combination of syntax and semantics. As the last word in its name says, it's a language. There's a big difference between that and an RDBMS, and that's why -- amid all the SQL vs. NoSQL uproar -- the newer database structures and SQL are already learning how to coexist.
Everybody knows SQL -- or so it seems
The fact that relational databases and SQL grew up together and have been a successful duo for more than 35 years means that SQL is known far and wide by people needing to access and query information in databases. It's the language first thought of by IT, business analysts and even business power users who function as departmental IT gurus. More pointed to the discussion is that SQL hasn't been limited to relational software for decades.
By the mid-1980s, a host of databases meant a host of access methods, and people began looking for a solution. Version 1.0 of the Open Database Connectivity (ODBC) interface was released in 1992; initially, it was designed as an API for translating between standard SQL and each relational database vendor's customized implementation of SQL. However, it quickly spread past RDBMS access to allow users to also get to flat files and other data sources. While ODBC is no longer as widely needed as it once was, the concept of extending SQL remains.
And vendors offering new types of databases that can help users get more business value from BI and analytics applications should understand that they need to leverage SQL if they want to gain broader market adoption. Numerous other vendors are already adding extensions to SQL to enable it to "talk" with NoSQL systems, as well as file-based Hadoop systems.
As Mark Milani, senior VP of product engineering at analytical database vendor Actian, put it to me, SQL's potential benefits to big data users are "too great to overlook." In the case of SQL-on-Hadoop query engines, for example, he noted that organizations using them can rely on existing SQL-trained workers instead of going out and trying to find "elusive and expensive" Hadoop experts.
SQL supports ecumenical data outreach
On their website, the folks working on the Apache Spark open source project describe Spark SQL -- an implementation of SQL for the big data processing engine -- as a "module for working with structured data." That's close, but they don't quite get it. Rather, the module should be looked at as a way for Spark coders to make SQL calls to whatever data structures are needed, whether relational, flat file, columnar or any other data source with a SQL interface. SQL is what will bring that heterogeneous mix of information to Spark for processing.
The English we speak today is not the English of Chaucer -- but it's still English. Languages extend themselves and evolve in the process. Telling overburdened IT organizations that they have to learn completely new programming languages or methodologies to flourish in the new big data world isn't going to win database vendors their business over competitors providing proven functionality at a cost that seems reasonable. Total cost of ownership matters to IT managers and business executives alike, and using SQL lowers TCO.
It's time for the NoSQL-only movement to fade away. While there are good reasons to use non-relational databases, and some separate tools will remain to deal directly with them, SQL will be the most important means of accessing the information stored in the specialized databases and Hadoop clusters that are the centerpieces of big data environments. In the end, the SQL vs. NoSQL debate is really no debate at all.
About the author:
David A. Teich is principal consultant at Teich Communications, a technology consulting and marketing services company. Email him at email@example.com.
Advice from consultant Craig S. Mullins on assessing your need for a NoSQL database
Consultant William McKnight says relational and NoSQL databases aren't mutually exclusive
SQL-on-Hadoop software could hold the key to broader adoption of Hadoop