This content is part of the Essential Guide: Using big data platforms for data management, access and analytics

Data lake meets warehouse in hybrid data architectures

A new view on hybrid data architectures, in which data lakes and warehouses coexist, emerged at EDW 2016. The hybrid approach has implications for data design, skills and planning.

As the Hadoop data lake gains more definition and deployments, it's beginning to look like something that will coexist with existing data warehouse technology. Such a view of hybrid data architectures emerged in sessions at the Enterprise Data World 2016 conference in San Diego, Calif.

"It's not an 'all or nothing' thing. It's a 'both' thing," consultant Joe Caserta told EDW 2016 attendees. "The enterprise data warehouse will not go away. Even when we are doing Hadoop and Spark and all the other shiny new things, it is still there."

But data lakes are finding a place in data science and big data analytics applications. Caserta, president and CEO of Caserta Concepts in New York, said Hadoop-based data lakes are typically built first of all to handle large and quickly arriving volumes of unstructured data. The data lake is a key part of big data trends that will bring change to data professionals' familiar practices, according to Caserta and others.

"What we used to do with data warehouses was first to create data models, but that has changed," Caserta said. With data lakes, the models come after the fact. "We don't do it right away anymore,'' he said.

Analytics and applications

One reason for that is the data lake's association with real-time data streaming. As analytics become more closely tied to operational applications, and part of real-time decision making, data is required to be accessible as soon as it's created, Caserta said. That, too, makes it very different from data warehouse work, which continues to be the foundation for necessary business reports.

This view was shared by Tom Place, director of data management at payment processing, retail data security and e-commerce services provider First Data Corp. He sees a distinction between uses for the data lake and data warehouse, as well as a need for both in data architectures.

"The data warehouse is really designed for slowly changing data -- daily summaries, weekly summaries and monthly summaries of known, structured data," Place said. "On the other hand, the data lake is being designed for quickly changing data -- data that tells you what happened one minute ago or five minutes ago."

Like Caserta, Place is seeing selective rollups of unstructured data from the lake going into the warehouse. 

Data reservoir days

As data lakes evolve, their days as a simple, undifferentiated refuge for data may be nearing an end. Caserta and Place both see different degrees of data governance being applied to different levels of data in the data lake.

The divisions are based on the purposes -- and skills -- of advanced analytics users. For Place, data consumers at Atlanta-based First Data comprise business analysts and data scientists, but also specialists in product innovation and product refinement. Example applications range from business reporting to fraud prevention.

Place said he actually prefers the term data reservoir to data lake. In his view, a reservoir conveys the idea that ingested data will be worked on.

"A data lake itself is just a collection of raw data that you don't understand. It can be something you can't manage and you can't validate for your users," he said. "With a reservoir, that data becomes well governed, well understood and well managed. And, you can actually do more valuable things with the data."

Up from the sandbox

As a term, data lake is far from universally welcome. It's not a favorite of Luminita Vollmer, senior IT architect for data and business intelligence delivery at Thrivent Financial, an insurance and investment management company in Minneapolis. She told an EDW 2016 crowd she preferred the common development term sandbox, because much of the data lake's use is experimental.

Still, in a session on the prospects of data warehousing, she told participants to look at their present data warehouse with a view toward how their organizations will use tools of the future, including NoSQL databases and predictive analytics software. Hadoop, she said, has already found a place in the data architectures of many organizations.

Like others, Vollmer said that a new spectrum of data analytics users is emerging. Things are different than they were when the enterprise data warehouse was the only game, she said, and that will affect the way data management teams are organized going forward.

"You have to have some people that support present systems and some people doing some research," Vollmer said. "That is a change in the way we do things."

Next Steps

Review some data lake basics

Read expert content on data lake needs

Learn about data lake use cases

Dig Deeper on Hadoop framework