Sergey Nivens - Fotolia
Data lakes, with all kinds of unconventional new data types, are generating buzz these days. But turning a Hadoop data lake into something useful can mean importing a data type that is neither new, nor out of the ordinary: relational data. After all, users want to combine traditionally structured data with newer unstructured data for analysis in the data lake, while also gathering together enterprise data from silos beyond their own department's boundaries.
Relational data is more familiar, but it is not necessarily a slam dunk for Hadoop data lake ingestion, as a major Texas school has found. The general message there is open source Hadoop components that require special skills may not be the best way to fill in a data lake with SQL data. More specifically, Apache Sqoop, as a common open source Hadoop component that can pull relational data into a Hadoop data lake, may hit some barriers in some organizations.
"We started out our data ingestion with Apache Sqoop, but it required that we do custom coding," said Juergen Stegmair, who is the lead for database administration at the University of North Texas (UNT).
He said that Sqoop's command-line-based programming required considerable coding in a style unfamiliar to the staff. Coding was not beyond them, but UNT's overall mission was to "avoid custom programming as much as possible," he said.
Shifting lakes of data
Shifting data to Hadoop data lakes is still a new experience for many teams, especially in IT shops within public universities, so adhering to a plan was important. That plan to achieve what Stegmair described as "a forward-looking architecture" began to germinate two years ago.
The architecture would incorporate open source Hadoop technology and add semistructured and unstructured data to the school's analytics portfolio. One pressing issue was higher velocity of data intake.
UNT opted to build a Hadoop data lake using software from Hadoop distribution provider Hortonworks. Strategic planning was followed by a first-stage implementation, which UNT began in September 2016. This focused on integrating data from existing SQL Server and Oracle databases.
After some work with the Sqoop bulk transfer tool, the school turned to software provider Attunity for data integration software to handle the migration of data to the data lake, according to Stegmair. The commercial software proved a good fit for this lake work.
"There are things that are just not there in open source infrastructure, and you have to implement it by yourself," Stegmair said. "Commercial tools still provide a useful level of management and enterprise adoption capabilities."
He said Attunity's Replicate tools were chosen to accomplish the SQL data transfer, and that a next stage of work on social media data ingestion is underway. The tools provided UNT data developers with a more visually oriented interface than Sqoop for creating workflows to handle data movement.
Large-scale, distributed Hadoop data lakes are a promising addition to enterprise software arsenals, according to Kevin Petrie, product manager at Attunity. But those data lakes call for changes in organizations' approaches to data transfer and data management. It can be important to employ tools that don't increase staff requirements, he said.
"Before, a data administrator might watch a single server; now, they may watch hundreds of servers with hundreds of endpoints," he said. "In a sense, the Hadoop data lake is a big step forward, but companies can't afford to have 12 administrators overseeing it."
Beyond the first load
Tools like Attunity Replicate, with variably configurable features that can be applied differently to different source and target formats, have value that goes beyond the first load for the acquisition and sourcing pieces, according to Merv Adrian, an analyst at Gartner. The ability to handle structured SQL data is important in this, he added.
Adrian said Gartner research positions structured transactional data as the major data source for big data analysis. In covering structured as well as unstructured data input, Attunity is not alone, he noted. These and other commercial tools can have some benefits over open source tools, however.
"There are many traditional tools that work there. The new ones in the Apache stack are not nearly as mature or full-featured," he said. "They are also less comprehensive in the number of sources and target formats supported."
Initial stocking of a data lake, incrementally adding to it, categorizing the data for recall -- these are all part of the process involved with moving Hadoop-style big data development from prototype to production. And further effort will be required.
As the lake concept has grown from a "simple-minded dumping ground" to being seen as a core supply component of the data fabric, Adrian said, there is still much work to be done.
Learn how to build a data lake from users
Take a deep dive into a state-of-the-art analytics architecture
Find out how to vault big data hurdles