Managing Hadoop projects: What you need to know to succeed
A comprehensive collection of articles, videos and more, hand-picked by our editors
The data management status quo has been disrupted by the emergence of Hadoop and a flood of other open source technologies tied to the distributed processing framework -- and there are many ramifications for IT teams tasked with deploying and managing Hadoop clusters to support big data applications. That theme arose from a conversation between SearchDataManagement and Forrester Research Inc. analyst Mike Gualtieri. On tap for discussion during the interview were Hadoop data lakes; SQL-on-Hadoop tools; and that dazzling Hadoop ecosystem newcomer, Spark.
Can you give us a sense as to how the data lakes now being formed relate to conventional enterprise data warehouses?
Mike Gualtieri: The Hadoop data lakes that are being built now are being built beside data warehouses, and they're largely being built with new data that is combined with existing data. There are lots of new data sources. Every time you push a button on a remote control, that gets stored by the cable company. It would be awfully expensive to store all those pushes in a data warehouse. That's new data -- Hadoop is a great spot for that. It's being used when you want to store something but the price tag is too high on a data warehouse.
But these data lakes are starting to evolve, and they're starting to look like mini data warehouses. I know it has been popular to say the data lake is not going to replace the enterprise data warehouse, but I think at some point it's going to absolutely be a data warehouse alternative at many companies. A couple of things are going to have to happen, though.
One is that Hadoop is going to have to get more efficient. Even though we're getting SQL-on-Hadoop [technology] that is more compliant [with SQL], there are still inherent technical limitations in Hadoop's ability to do I/O, and that is often a limiting factor for performance. There, the data warehouse appliance has an advantage. The second thing is there are going to have to be better data governance tools that allow you to model and organize data in Hadoop. Right now, a lot of these Hadoop data lakes are getting messy. Some call them swamps.
How confident can we be that SQL-on-Hadoop will close the performance gap?
Gualtieri: Hadoop is getting faster and faster with each release of various engines. Vendors like Actian have taken their existing technology to run SQL in Hadoop. You also have IBM, Oracle, Microsoft. You have some very sophisticated SQL [interfaces] that over the last year or two have been slowly ported to Hadoop. That's giving you orders-of-magnitude better performance than you had originally with Hive.
It's still not to the point the data warehouse is, because the data warehouse is optimized hardware and software. But I think there is enough performance for many, many applications. Is there enough performance for large numbers of users doing concurrent interactive queries? Well, that is where you're going to need more I/Os. Say you have 100 business intelligence users running SQL queries concurrently against Hadoop -- it's not optimized for that. But if you have a few data scientists and a few data analysts, they're going to get lots of benefits out of [querying Hadoop data].
Hadoop may have made a bridgehead for the Spark analytical processing engine, given that so many Spark applications today run on the Hadoop Distributed File System [HDFS]. But you've said Spark may come to stand alone.
Gualtieri: Yes, Spark is a data processing platform just like Hadoop is. It's a cluster computing environment just like Hadoop is. A lot of people associate Spark with Hadoop, but Spark is perfectly fine running in its own cluster. The question is: Is Spark an alternative to Hadoop or do they belong together? In most of the implementations you're now hearing about, Spark and Hadoop are together. Spark doesn't have its own file system, and if you want to run a Spark job, well, it's kind of convenient to run it against HDFS, because that's where the data is today.
You often hear people talk about the in-memory [processing] aspect of Spark. And that has been important for it. Because of in-memory, it's much faster. But the thing you hear less about Spark, but which also makes it popular, is the programming model. Between the way you write a Spark processing job versus the way you write a Hadoop-MapReduce job -- it's much less complicated with Spark.
As a result, you even hear some Hadoop vendors say that we'll see MapReduce slowly go away -- that when someone is going to write a data processing job, they're going to write Spark on top of Hadoop. Moreover, Spark can be programmed in Scala, Python and Java, and they're also working on R -- while, for the most part, Hadoop is programmed in Java. So, more programmers can write a job in Spark more easily, and it's going to run faster -- if you have enough memory.
Learn how a data lake can become a primary info repository
Find out why a data lake is not a place for just lounging around
Watch a video with discussion and analysis of the data lake concept