Newly emerging data management and database technologies like analytic and NoSQL databases, Hadoop and MapReduceare gradually becoming mainstream alternatives to traditional relational databases -- especially for enterprises with highly data-intensive computing requirements, according to experts.
But with all of the vendor hype that inevitably accompanies game-changing software, enterprises can have a tough time figuring out which new database technologies -- if any -- are right for them.
To help out on this front, SeachDataManagement.com got on the phone with data management technology expert David Menninger, a vice president and research director with Ventana Research Inc. in Belmont, Calif. Menninger talked about the definitions, benefits and drawbacks of analytic databases, Hadoop and MapReduce, and he explained why NoSQL databases should probably be called “Not-only-SQL databases.”
Why do we keep hearing so much about analytic databases these days?
David Menninger: Analytic databases are appealing because they are SQL-based, so it’s the knowledge that you currently have about using Oracle or IBM or Teradata or whatever. You can apply that same SQL-based knowledge to these analytic databases. In the analytic database category, I would include columnar databases and MPP, [however, most analytic databases] include both MPP and some form of columnar technology. [Vendors in this market include EMC-Greenplum], Aster Data, Vertica, InfoBright, Paraccel, [IBM-Netezza] and then there are some other newer, smaller ones.
What are some of the other benefits of analytic databases?
Menninger: [Analytic database vendors] have found ways to accelerate database processing -- to be able to handle larger volumes of data and process that data quickly and do it in a SQL-based environment, so those are good things. Generally, I think all of them are doing it on commodity-based hardware, with the exception of Netezza, really. Netezza is doing it in some respects on [IBM] commodity hardware, but it is commodity hardware configured into a proprietary box that they assemble and sell with some proprietary components thrown in. The appeal of being able to use inexpensive hardware to solve these problems fundamentally results in lower cost of ownership. People want performance, they get performance and they can do it in a way where they don’t have to pay an arm and a leg, so that’s appealing to people.
What is the downside of analytic databases?
Menninger: The downside is that, to varying degrees, there are issues of, I’ll say, SQL completeness. None of them are going to be as complete as, for instance, Oracle or DB2. Also, Oracle has PL/SQL and these other products don’t necessarily support PL/SQL. I understand Netezza uses some of EnterpriseDB’s technology to provide PL/SQL compatibility. But you know it’s never going to be 100%. So, the downside is that if you’re moving from an Oracle environment, there could be some compatibility issues with some of the Oracle specific SQL you’ve written.
Do you have any other warnings about analytic databases?
Menninger: The ecosystem around these third-party products is not necessarily quite as big as the ecosystem around the traditional products. Since they’re all SQL-based, many tools do work with these products, but again, it’s not quite as big as the ecosystem around the other products.
What is Hadoop?
Menninger: Functionally, what Hadoop provides is a way to store and analyze large amounts of data on many processors or many machines. Fundamentally, it has two components, a distributed file system, so it has a way to take a set of data and split it up into different machines and to do it in a way that provides redundancy. I think you get three copies of every bit of data beyond three different nodes in the system. And so, if one node goes down, you still have access to that data because it’s on two other nodes as well. That is referred to as HDFS, the Hadoop Distributed File System. There are lots of other pieces [to Hadoop] -- 11 pieces altogether. But the two major pieces are HDFS [and MapReduce], which is for analyzing the data. If you spread the data across 10 machines, some amount of analysis has to be brought together to one machine. [This is done with MapReduce].
What are the chief advantages of Hadoop and MapReduce?
Menninger: There are a couple of advantages. First of all it’s open source, so that implies free, although it’s not entirely free, because you might want to pay for support. But fundamentally, it’s a lower-cost alternative. There is no database license per se, and it can handle very large amounts of data because you can take 10, 50, 100 machines to do the processing. The infrastructure around it handles the parallel processing. You write these relatively simple routines for the mapping and the reduction, and the infrastructure takes responsibility for scheduling the jobs on each of the 100 machines and making sure that all 100 complete successfully. If one fails, it will redistribute that work to the other machines. So, the advantages of Hadoop are potentially even lower costs than the analytical databases and potentially even more scalability than the analytic databases.
What are the big drawbacks of Hadoop and MapReduce?
Menninger: Now we get into a much bigger set of risks or downsides. It’s not a SQL environment, so it’s not [something that can be leveraged] with the skill sets that you have. But you can leverage a different set of skills. For instance, you can write the MapReduce jobs in a whole bunch of different languages, so there is probably a language that you have skills in within your organization, but it’s not SQL-based. It fundamentally takes you away from this database skill set that you have in your organization. Because it’s not SQL-based, the variety of tools that work with it is also much smaller, and [you’ll have] to start thinking about the tools you’re going to use.
There is some level of SQL interaction with Hadoop. One of the 11 components I glossed over is in fact a limited set of SQL to interact with the data. And so vendors are taking this limited set of SQL capabilities and extending it so that they can make their tools work with it or having their tools operate with this limited set of SQL. But the whole ecosystem is much smaller than the ecosystem of the analytic database vendors. However, if you have a large amount of data, that may be worth it.
When should a company choose a NoSQL database?
Menninger: In the NoSQL category, it’s just about performance. The overriding concern and factor is performance. While “NoSQL” perhaps originally might have meant no SQL involved, it’s actually Not-only-SQL. There can be some amount of SQL capabilities. The thing about NoSQL is that it’s fundamentally for dealing with very high-performance situations. Some of those might be real-time scenarios in the online gaming world or real-time scenarios in the stock trading world [where there are] very large amounts of data where maybe you store key value pairs as opposed to storing a less structured file. But that’s what the NoSQL environment is all about. They are often open source. In fact, I’m not aware of any proprietary NoSQL products. I think it is part of the culture and ethos that if you’re going to go down the NoSQL path, they tend to be open source products.