With all the NoSQL, NewSQL, Hadoop and SQL-on-Hadoop databases, along with the columnar relational database management system (RDBMS) extensions, it is an exciting time for data architecture – maybe too much so. Long-time database management system (DBMS) industry observer Curt Monash helps sort through the maze of data planning today. This is part two in a series. Read part one, on data parallelism.
Sometimes I wonder what guide the past can provide. When they were new, it took a while for the first RDBMSs to settle into three or four well-known examples, and for a fairly common way to work with SQL to emerge.
Monash: Well, back in the 1980s we had several different kinds of DBMSs. Before the advent of the relational DBMS, every DBMS product had its own data manipulation language. Within a few years, SQL won. At that point, we had the new DBMSs having similar languages and similar programming models.
And there probably were five to 10 that mattered – most of which have survived in some form or another to this day – such as Oracle; DB2; Microsoft SQL Server – which itself is a fork of Sybase's main DBMS; Progress OpenEdge; and Teradata and Sybase IQ. And that's where things were for a while.
What created the opportunity for new RDBMSs were cost shifts in various aspects of the technology and, in connection with that, the importance of the scale-out clusters.
In a way, the Oracle DBMS today has taken on the role that the IBM mainframe operating system had a few decades ago. It is used in a large fraction of large enterprises. It is reliable. It is old. It is cumbersome. It is expensive. Oracle has high market share so it provides a price umbrella for others.
A lot of enterprises have invested in a lot of people around that DBMS; not everybody involved thinks having high administrative costs is a bad thing. People who are well paid to be in jobs that are a part of those cost structures think it is pretty good. All those things that used to apply to the IBM mainframe system apply to Oracle today.
Now, with the new players in NoSQL, Hadoop and NewSQL, they tend to excel on certain types of workloads and to be inappropriate for others. But, the better combinations of best-product and small-but-strong companies are likely to prosper for quite a while.
Yet, the technology manager is confronted with so many options. One team member might promote Spark, and another Impala, and yet another columnar extension for PostgreSQL, and so on. How do they logically proceed?
Monash: It is very hard to answer that without having a particular workload in mind. You can't ask someone to pick between an 18-wheel truck or a sports car or a bicycle without knowing the workload.
It really depends on a lot of things. There are some workloads that tightly integrate SQL with very non-SQL processing - that go even beyond the analytical extensions to SQL RDBMSs. For some workloads, SQL-on-Hadoop is overwhelmingly the way to go. For that matter, there are companies that are adamantly opposed to paying for software at all. They like open source alternatives and many of those open source scale-out SQL alternatives will work on Hadoop.
More on data architecture
There are [many] considerations. One of them is the absolute size of the database. Another is the absolute magnitude of throughput of work that you need. And there is your concurrency -- how many queries are likely to be hitting the database at the same time -- or how many people are actually going to be using the system at the same time.
Another consideration is what kind of work you are doing besides SQL queries. Yet another is the complexity of the SQL queries is, are there likely to be 14-way joins and lots of tables or are they a lot simpler than that? That is an incomplete list, but you can probably start the discussion with those.