michelangelus - Fotolia

Manage Learn to apply best practices and optimize your operations.

Swim fast with a Hadoop data lake architecture -- or sink

The Hadoop data lake concept presents plenty of challenges for organizations. But the experiences of early adopters point the way toward successful data lake architecture deployments.

There's no question that there are data lake skeptics. The term practically invites sarcastic variants -- data swamp, data puddle -- and visions of watery doom. And there are more substantive arguments against the validity of the Hadoop data lake architecture. 

Gartner is a prominent doubter -- the consulting and market research outfit stated its case in a July 2014 report punningly but sharply titled The Data Lake Fallacy: All Water and No Substance. The report pointed to data lake challenges such as culture changes, lack of skills and data governance issues. In an accompanying press release, Gartner analyst Andrew White said that "without at least some semblance of information governance, the lake will end up being a collection of disconnected data pools or information silos all in one place."

It's enough to give you a sinking feeling. But not everyone is down on the data lake. In an article published on companion site SearchBusinessAnalytics in Aug. 2014, consultant Wayne Eckerson wrote that analytics architectures of the future could plausibly revolve around Hadoop, which he described as "a scalable, flexible data processing platform that can meet most enterprise data analysis requirements." 

And indeed, there are organizations that are, yes, diving into the data lake by deploying Hadoop clusters as their lead platform for collecting raw data and then processing and analyzing it.

One pixel BizApps Today: Diving into the data lake

SearchDataManagement and SearchBusinessAnalytics have published a series of stories that highlight the experiences of some of those users. In one, we examine the issues that three companies faced in building and managing data lakes, and how they addressed those challenges. In another, we explore data lake deployments at insurer Allstate and managed services provider Solutionary. A third story looks at Hadoop's potential business benefits through the eyes of one IT exec and data lake manager.

We also have insight on the data lake concept from various consultants and industry analysts. In a Q&A, Eckerson details hurdles and misconceptions that can hinder data lake development.  In other interviews, Mike Gualtieri of Forrester Research discusses what needs to happen to make data lakes more broadly feasible and consultant Joe Caserta says not to ignore traditional IT principles when designing a Hadoop data lake architecture. And Andy Hayler of The Information Difference calls for more examples of viable big data use cases to help keep data lake adoption from bogging down.

Craig Stedman is executive editor of SearchDataManagement. Email him at cstedman@techtarget.com and follow us on Twitter: @sDataManagement.

Next Steps

A Hadoop data lake project isn't all rest and relaxation

What the data lake buzz is all about, plus a reality check

The data lake's key enabler: Hadoop 2 and its YARN resource manager

Dig Deeper on Hadoop framework

Join the conversation


Send me notifications when other members comment.

Please create a username to comment.

What's your top tip for building a Hadoop data lake architecture?
History should be a guide for us. After all, what led to the rise of RDBMS over the last 40 years is not that developers could not implement better data storage and retrieval techniques. These more "soft" drivers were central to its rise:
  • RDBMS's provided an information radiator whereby developers could garner valuable information regarding a system's data, structure and relationships, aiding their own work. Development and operational overhead ("code archaeology") was greatly reduced.
  • Implementers could avoid a lot of the "plumbing" efforts, and this core system component (the RDBMS) allowed implementers across the profession to avoid solving the same problem over and over again with different, opaque approaches.
So a "Data Lake" is not much different in that the underlying drivers (concise information about a system's data/structure and re-use) remain today. However, Hadoop is a different beast, and the fact that it - at its core - is a distributed file system with an elementary "data operating system" means the profession was sent back to square one. This is why SQL-on-Hadoop solutions, even its early, error-prone, limited-capability implementations were embraced.
So what is the tip for building a Hadoop data lake architecture? Two things are required:
  • Provide an information radiator with metadata details. However, given Hadoop's nature in that it scales out and can support many, potentially cross-departmental, datasets, the metadata must not only describe the data, it must also describe its source, its business meaning and how it relates to processing that has taken place and when its underlying files have arrived in the file system.
  • Provide a pluggable "plumbing" framework along with a UI that reduces the need for "code archaeology". Again, given Hadoop's nature, there are a dizzying set of tools in the toolkit. This is a big difference from the RDBMS where you had a familiar set of slowly growing capabilities over the decades (i.e. the built-in set capabilities such as cursors, aggregate functions, stored procedures, etc.). With Hadoop, a developer tasked with building a solution that aggregates transaction data might use Pig, Hive, MapReduce, Spark with Java, Scala, Python, etc. This cornucopia of "plumbing" tools resurrects the risk that developers across the enterprise will implement the same problem over and over again in myriad ways.
Good points, clukasik. To borrow a phrase from my colleague Jack Vaughan, the breadth of the Hadoop ecosystem appears to be both a blessing and a curse for user organizations. All the tools surrounding Hadoop bring a lot of functionality -- but they can amount to quite a jumble of technologies.
This is a nice post. We agree - we've been involved in many successful production implementations of Hadoop Data Lakes. I've talked with Gartner at length on this topic, and while they are cautious about the speed of adoption, they don't seem pessimistic about the data lake's value. They just want to see the data lake management capabilities mature. I believe managed ingestion is key - and metadata (business, technical and operational) should be created and actively curated. Otherwise, the Data Lake is a wasted opportunity.
Thanks for your comment, KellyatZaloni. Creating a process for making the different streams of data flowing into a data lake consistent, for accurate analysis, was one of the big deployment issues cited by some experienced Hadoop cluster managers I quoted in one of our stories. It isn't all sunshine and easy sailing on the data lake!