News Stay informed about the latest enterprise technology news and product updates.

Hadoop data lake architecture tests IT on data integration

Hortonworks users talk about building Hadoop data lakes to support new applications -- and the challenges their teams face on ingesting and refining data for end users.

SAN JOSE, Calif. -- These days, a rolling pageant of glittering new data objects includes data planes, data fabrics,...

data streaming and more. It's almost enough to make you forget about the shiny object of just a few years ago: the Hadoop data lake architecture.

But plenty of data teams are still working hard to successfully implement real-life Hadoop data lakes -- ones that underlie many organizations' hopes for better predictive analytics and maybe even artificial intelligence.

Such was one of the lessons from Hortonworks' DataWorks Summit 2018. That is where people like Sudhir Menon told the story behind big companies' moves to use their data hoards for digital transformation -- as seen at tech-infused upstarts, like Airbnb and Uber.

The journey that Menon, vice president of enterprise information management at hotelier Hilton Worldwide, described includes a Hadoop data lake architecture as an integral part.

"We have a lot of information in different formats, and we are bringing that into the data lake. Every [data] entity from every channel is coming into the lake now," Menon told a conference session audience.

The Hadoop data lake architecture has formed a basis for a potential consumer application -- a new digital key app that allows Hilton Honors program guests to, in effect, check themselves directly into their room, he said.

Populating the data lake takes time

Still, this will be a multiyear project, Menon noted. The project, which seeks to incrementally build out the Hadoop-based Hortonworks Data Platform (HDP) into a new repository for enterprise data, began about two years ago and is now moving toward going live, with many more "agile sprints" ahead, according to Menon.

The system employs a variety of tools beyond HDP, including WSO2 API management, Talend integration and Amazon Redshift cloud data warehouse software.

Sudhir Menon, vice president of information management at Hilton Worldwide
Sudhir Menon, Hilton Worldwide

For Menon's team, populating a data lake means transforming assorted ingested data into JSON events with a microservices architecture.

The transformations are a first step in a data-refining process. The experience of many data lake users has shown data has to be sorted into some sensible format as soon as possible if it is to be used with BI tools by business analysts, although raw data versions are still preserved for experimental data scientists.

Menon emphasized, while this "renovation and innovation" project supports data science, it also provides a new foundation for leaner everyday data reporting. He said, during the course of the project, Hilton has decommissioned 380 dashboards for management reporting and replaced them with a more compact roster of 40 dashboards.

What price, data democracy?

For companies like Hilton, with years of legacy data, Hadoop data lake architectures can require significant effort.

Another data lake project is underway at United Airlines. At DataWorks, Joe Olson, senior manager for big data analytics at the Chicago-based airline, recounted a move to a new big data analytics environment that includes a data lake, along with a "curated layer of data."

Olson described the significant work required to connect existing Teradata data warehouse analytics with pieces of Hortonworks' platform.

We have a lot of information in different formats, and we are bringing that into the data lake. Every [data] entity from every channel is coming into the lake now.
Sudhir Menonvice president of enterprise information management at Hilton Worldwide

"Moving small data sets is trivial ... but we don't yet have a good way to handle large data sets of certain types," Olson said in a session on big data at United.

In yet another session, which was whimsically entitled "The Curious Case of Data Lake Redemption," Shivinder Singh, a database architect and engineer at Verizon Wireless, based in Basking Ridge, N.J., described issues the telecom giant encountered as it opened up a Hadoop data lake to a wider group of analytics users.

Singh described differences in block file sizes in Hadoop data lakes versus single-client implementations that can bring about garbage collection issues with observable performance impact. He said his teams worked with Hortonworks to address such issues in its data lake buildouts.

Despite such implementation issues, he emphasized, the Hadoop platform has helped drive diverse analytics advances, as Verizon Wireless evolved its architecture to deal with bigger and evermore variegated amounts of data.

Sampling the data fabric

While the Hadoop data lake architecture was meant, in part, to reduce data silos in organizations, the reality has been that several data lakes may arise, becoming silos in themselves. At its user event, Hortonworks expanded on its recent discussions of data fabric architectures, meant to mesh varied data lakes and other data framework components.

The company's Hortonworks DataPlane Service (DPS) is an example, being a software layer that handles governance and data management for multiple data lakes.

Like other products from vendors that once focused solely on Hadoop Distributed File System data storage, DPS supports a variety of storage formats, including cloud object storage.

"The data fabric is an idea that has been around and has evolved to encompass the reality that, going forward, these systems will be hybrids of on-premises and public cloud systems and, eventually, will be on multiple clouds," said Doug Henschen, an analyst at Constellation Research, in an interview

That could mean that, while the data lake may continue to find use, it may not actually be a Hadoop data lake architecture, Henschen said.

"Now, companies want the data lake concept to encompass more than just Hadoop," he said.

Dig Deeper on Big data management

Join the conversation

2 comments

Send me notifications when other members comment.

Please create a username to comment.

What challenges do you see in putting data lake architecture into use for your organization?
Cancel
Data Migration from multiple source to data lake environment is the challenge. The chosen data base (MEMSql) does not support many data types, which projects a limitation or shortfall to prove our data lake storage platform as incorrect. I assume HDFS would have done wonders.
Cancel

-ADS BY GOOGLE

SearchBusinessAnalytics

SearchAWS

SearchContentManagement

SearchOracle

SearchSAP

SearchSQLServer

Close