Sergey Nivens - Fotolia

High-level tools help Hadoop data lake go to school

Moving relational data into Hadoop isn't a slam dunk. To avert programming complexities, a major Texas university turned to commercial tools to stock its Hadoop data lake.

Data lakes, with all kinds of unconventional new data types, are generating buzz these days. But turning a Hadoop data lake into something useful can mean importing a data type that is neither new, nor out of the ordinary: relational data.

After all, users want to combine traditionally structured data with newer unstructured data for analysis in the data lake, while also gathering together enterprise data from silos beyond their own department's boundaries.

Relational data is more familiar, but it isn't necessarily a slam dunk for Hadoop data lake ingestion, as a large Texas school has found. The general message there is open source Hadoop components that require special skills may not be the best way to fill in a data lake with SQL data. More specifically, Apache Sqoop, as a common open source Hadoop component that can pull relational data into a Hadoop data lake, may hit some barriers in some organizations.

"We started out our data ingestion with Apache Sqoop, but it required that we do custom coding," said Juergen Stegmair, who leads the database administration team at the University of North Texas (UNT).

Stegmair said that Sqoop's command-line-based programming required considerable coding in a style unfamiliar to his staff. Coding wasn't beyond them, but UNT's overall aim was to "avoid custom programming as much as possible," he said.

Shifting lakes of data

Shifting data to Hadoop data lakes is still a new experience for many teams, especially in IT shops within public universities, so adhering to a plan was important. That plan to achieve what Stegmair described as "a forward-looking architecture" began to germinate two years ago.

The architecture would incorporate open source Hadoop technology and add semistructured and unstructured data to the school's analytics portfolio. One pressing issue was higher velocity of data intake.

UNT, which is located in the city of Denton, opted to build a Hadoop data lake using software from Hadoop distribution provider Hortonworks. Strategic planning was followed by a first-stage implementation, which UNT began in September 2016. This focused on integrating data from existing SQL Server and Oracle databases.

Commercial alternative

After some work with the Sqoop bulk transfer tool, the school turned to software provider Attunity for data integration software to handle the migration of data to the data lake. The commercial software proved a good fit for the data lake work, according to Stegmair.

"There are things that are just not there in open source infrastructure, and you have to implement it by yourself," he said. "Commercial tools still provide a useful level of management and enterprise adoption capabilities."

Stegmair said that Attunity's Replicate tools were chosen to accomplish the SQL data transfer, and that a next stage of work on social media data ingestion is underway. The tools provided UNT data developers with a more visually oriented interface than Sqoop for creating workflows to handle data movement, he said.

Large-scale, distributed Hadoop data lakes are a promising addition to enterprise software arsenals, according to Kevin Petrie, product manager at Attunity. But those data lakes call for changes in organizations' approaches to data transfer and data management. It can be important to employ tools that don't increase staff requirements, he said.

"Before, a data administrator might watch a single server; now, they may watch hundreds of servers with hundreds of endpoints," he said. "In a sense, the Hadoop data lake is a big step forward, but companies can't afford to have 12 administrators overseeing it."

Beyond the first load

Tools like Attunity Replicate, with variably configurable features that can be applied differently to different source and target formats, have value that goes beyond the first load for the acquisition and sourcing pieces, according to Merv Adrian, an analyst at Gartner. The ability to handle structured SQL data is important in this, he added.

Adrian said Gartner research positions structured transactional data as the major data source for big data analytics. Attunity isn't alone in covering structured as well as unstructured data input, he noted. And he thinks the commercial tools can have some benefits over open source alternatives.

"There are many traditional tools that work there. The new ones in the Apache stack are not nearly as mature or full-featured," Adrian said. "They're also less comprehensive in the number of sources and target formats supported."

Initial stocking of a data lake, incrementally adding to it, categorizing the data for recall -- these are all part of the process involved with moving Hadoop-style big data development from prototype to production. And further effort will be required.

As the data lake concept has grown from a "simple-minded dumping ground" to being seen as a core supply component of the data fabric, Adrian said, there is still much work to be done.

Next Steps

Learn how to build a data lake from users

Take a deep dive into a state-of-the-art analytics architecture

Find out how to vault big data hurdles

Dig Deeper on Hadoop framework