This content is part of the Essential Guide: Managing Hadoop projects: What you need to know to succeed

When building Hadoop data lakes, don't plunge in without a plan

A data lake may seem like a clear choice -- deposit all your data in a Hadoop repository for analytics uses -- but navigating those IT waters isn't so simple, consultant Wayne Eckerson says.

As Hadoop data lakes enter mainstream IT consciousness, more companies are starting to dip their toes in the water -- but at this point, many are developing data ponds for only a portion of their data rather than full-fledged lakes, according to Wayne Eckerson, principal consultant at Eckerson Group, a research and consulting company that focuses on business intelligence and analytics initiatives.

Data lakes use clustered systems based on the open source Hadoop framework and commodity hardware to process, store and manage pools of big data, typically to support analytics applications. Proponents say a data lake architecture offers a significantly less expensive alternative to traditional data warehouses, along with the ability to handle a mix of structured, unstructured and semi-structured information. But the data lake concept is relatively new and still murky, with hidden complications and challenges that may be overlooked as companies decide to take the plunge. Hadoop and related technologies "have come a long way in the last seven years, but there's still a way to go, for sure," Eckerson said in an interview with SearchDataManagement.

In this Q&A drawn from the interview, Eckerson describes the potential benefits of building a data lake, plus the hurdles and misconceptions that can hinder the process. He also offers advice on answering the fundamental question: Is a Hadoop data lake right for your organization?

How prevalent are Hadoop data lakes among the companies that you work with?

Wayne EckersonWayne Eckerson

Wayne Eckerson: A lot of companies are experimenting with data lakes [or] exploring the idea. A data lake is really when you put all of your data into [a Hadoop cluster]. Companies are certainly thinking of doing that, especially if they don't have a data warehouse -- they may want to start with Hadoop instead of a relational database if they have the skills on board to do it. But I think data lakes are something vendors are pushing more than users are adopting. It's more a vendor thing than reality.

Why should organizations consider building a data lake? What are the big benefits that data lakes can provide?

Eckerson: There are benefits in theory. You put all your data in this lake, and you don't have to move it. Today, what happens is that you have to move your data to the right processing system that supports the task you're trying to do. The data lake promises that you leave the data where it is and bring the processing to it -- the less you touch the data, the better. The question is, we spent the last 20 years figuring out how to process data in a reliable way that gives accurate answers, and no one has quite figured out exactly how to do that in the data lake yet. There's all this unsexy stuff that you need to do to data to make sure it's in the right form to process so you get the right answers. People are overlooking that a lot [with data lakes] because the costs are so low.

Do people have misconceptions about data lakes that can lead to problems when organizations try to deploy them?

Eckerson: With any new technology, there's a tendency to think it's the cure-all to everything that ails you -- the silver-bullet syndrome. [Building a data lake] isn't so simple today. Hadoop requires a high level of expertise, and there's been a lot of functionality missing -- like security, management, backup and recovery, interactive queries. All of this is being built very quickly into the platform, so it's becoming more and more robust and enterprise-ready. But there are still some things it doesn't support or where it isn't as reliable as your traditional data warehousing environment. You can't just give people access to the raw data and expect them to do anything with it. You need to have people go in there to build up different views of the data -- different constructs, little sandboxes for people to look at depending on their department or function. It's kind of the same process we went through with the data warehouse, just a different technology. I think the biggest misconception is that Hadoop is a ready-to-use environment for business use. It's not. It's still an area for specialists with certain skills.

Are data lakes for everyone? Or is data lake technology more suitable for particular kinds of organizations?

Eckerson: Any company can experiment and play with it if they want to, if they have time and resources. I would say that you have to be open to doing this. You have to have time to experiment with it and have a vision of what it can do for you. Companies that are early adopters of technology have already looked at [Hadoop] and deployed it in a big way. It's in the early mainstream now, but there are still a lot of companies that are struggling too much with the current [data warehousing] technology to even think about bringing it in. It's a culture thing: Some are more inclined to adopt new technologies than others.

Corlyn Voorhees is an editorial assistant for SearchDataManagement. Email her at [email protected], and follow us on Twitter: @sDataManagement.

Next Steps

What has to happen for Hadoop data lakes to become a viable alternative to enterprise data warehouses

Will the murky depths of data lakes provide bountiful information for analytics?

Find out why Gartner has cautioned against wading too deep into data lakes

Dig Deeper on Big data management