Essential Guide

Managing Hadoop projects: What you need to know to succeed

A comprehensive collection of articles, videos and more, hand-picked by our editors
News Stay informed about the latest enterprise technology news and product updates.

Hadoop data lakes must get more efficient, less 'messy' to oust EDWs

In a Q&A, Forrester analyst Mike Gualtieri said Hadoop-based data lakes can become an alternative to enterprise data warehouses. But first, faster I/O and better data governance are needed.

The data management status quo has been disrupted by the emergence of Hadoop and a flood of other open source technologies tied to the distributed processing framework -- and there are many ramifications for IT teams tasked with deploying and managing Hadoop clusters to support big data applications. That theme arose from a conversation between SearchDataManagement and Forrester Research Inc. analyst Mike Gualtieri. On tap for discussion during the interview were Hadoop data lakes; SQL-on-Hadoop tools; and that dazzling Hadoop ecosystem newcomer, Spark.

Can you give us a sense as to how the data lakes now being formed relate to conventional enterprise data warehouses?

Mike Gualtieri: The Hadoop data lakes that are being built now are being built beside data warehouses, and they're largely being built with new data that is combined with existing data. There are lots of new data sources. Every time you push a button on a remote control, that gets stored by the cable company. It would be awfully expensive to store all those pushes in a data warehouse. That's new data -- Hadoop is a great spot for that. It's being used when you want to store something but the price tag is too high on a data warehouse.

Mike Gualtieri

But these data lakes are starting to evolve, and they're starting to look like mini data warehouses. I know it has been popular to say the data lake is not going to replace the enterprise data warehouse, but I think at some point it's going to absolutely be a data warehouse alternative at many companies. A couple of things are going to have to happen, though.

One is that Hadoop is going to have to get more efficient. Even though we're getting SQL-on-Hadoop [technology] that is more compliant [with SQL], there are still inherent technical limitations in Hadoop's ability to do I/O, and that is often a limiting factor for performance. There, the data warehouse appliance has an advantage. The second thing is there are going to have to be better data governance tools that allow you to model and organize data in Hadoop. Right now, a lot of these Hadoop data lakes are getting messy. Some call them swamps.

How confident can we be that SQL-on-Hadoop will close the performance gap?

Gualtieri: Hadoop is getting faster and faster with each release of various engines. Vendors like Actian have taken their existing technology to run SQL in Hadoop. You also have IBM, Oracle, Microsoft. You have some very sophisticated SQL [interfaces] that over the last year or two have been slowly ported to Hadoop. That's giving you orders-of-magnitude better performance than you had originally with Hive.

It's still not to the point the data warehouse is, because the data warehouse is optimized hardware and software. But I think there is enough performance for many, many applications. Is there enough performance for large numbers of users doing concurrent interactive queries? Well, that is where you're going to need more I/Os. Say you have 100 business intelligence users running SQL queries concurrently against Hadoop -- it's not optimized for that. But if you have a few data scientists and a few data analysts, they're going to get lots of benefits out of [querying Hadoop data].

Hadoop may have made a bridgehead for the Spark analytical processing engine, given that so many Spark applications today run on the Hadoop Distributed File System [HDFS]. But you've said Spark may come to stand alone.

Gualtieri: Yes, Spark is a data processing platform just like Hadoop is. It's a cluster computing environment just like Hadoop is. A lot of people associate Spark with Hadoop, but Spark is perfectly fine running in its own cluster. The question is: Is Spark an alternative to Hadoop or do they belong together? In most of the implementations you're now hearing about, Spark and Hadoop are together. Spark doesn't have its own file system, and if you want to run a Spark job, well, it's kind of convenient to run it against HDFS, because that's where the data is today.

You often hear people talk about the in-memory [processing] aspect of Spark. And that has been important for it. Because of in-memory, it's much faster. But the thing you hear less about Spark, but which also makes it popular, is the programming model. Between the way you write a Spark processing job versus the way you write a Hadoop-MapReduce job -- it's much less complicated with Spark.

As a result, you even hear some Hadoop vendors say that we'll see MapReduce slowly go away -- that when someone is going to write a data processing job, they're going to write Spark on top of Hadoop. Moreover, Spark can be programmed in Scala, Python and Java, and they're also working on R -- while, for the most part, Hadoop is programmed in Java. So, more programmers can write a job in Spark more easily, and it's going to run faster -- if you have enough memory.

Jack Vaughan is SearchDataManagement's news and site editor. Email him at, and follow us on Twitter: @sDataManagement.

Next Steps

Learn how a data lake can become a primary info repository

Find out why a data lake is not a place for just lounging around

Watch a video with discussion and analysis of the data lake concept

This was last published in February 2015



Find more PRO+ content and other member only offers, here.

Essential Guide

Managing Hadoop projects: What you need to know to succeed

Join the conversation


Send me notifications when other members comment.

By submitting you agree to receive email from TechTarget and its partners. If you reside outside of the United States, you consent to having your personal data transferred to and processed in the United States. Privacy

Please create a username to comment.

How do you use Hadoop-based data lakes?
Apache Hadoop File System is distributed to support large clusters and scales of hardware with the ability to store massive data at a reasonable cost. HDFS also comes as schema-less enabling it to support any file format including unstructured or semi-structured data. Its Ingestion Tier has multiple capabilities that allow data ingestion in bulk over high velocity. Since it can store massive data, it is customized to allow complex event processing.
I'm still not sure that SQL-on-Hadoop is the answer. I'm not even sure SQL-In-Hadoop is the answer. I worked with a very large company in the nineties that had a staff of 1000 analysts. Today we might even call them data scientists. They all quantitative Masters or PhD degrees. When they were hired, they were trained in two things - SAS and SQL. All of them learned SAS, less than half ever became proficient in SQL.
Two points from me. One, what nraden is saying is true. In fact, the company he/she worked could be my company (former Bell Labs). More importantly though, Hadoop has still not found its way into data models. May be I am old-fashioned, but ı I think a data model which holds the "single version of the truth" is still required for consistent reporting and analysis. Hadoop/Data Lake can be used for near real-time, operational reporting though.
Despite the technical limitation of Hadoop to do I/O after getting SQL, it is possible to have better data governing tools to organize information.