Pixelbliss - Fotolia

Hadoop tool finds low-hanging fruit for migrating data warehouse jobs

It's still hard to move existing data warehouse jobs to Hadoop, but emerging tools can help. Cloudera Navigator Optimizer is one such Hadoop tool tapped by digital marketing company Conversant.

Offloading work from a high-cost data warehouse is sometimes seen as a first target for Hadoop commodity clusters. Moving over extract-transform-and-load (ETL), querying, and reporting jobs doesn't dramatically change the way business is done, but it can potentially curb data warehouse growth and costs.

But, even several years into the Hadoop era, migrating jobs to the distributed platform isn't necessarily easy. Being able to figure out which jobs can move over without troublesome amounts of developer effort can help data managers concentrate their first efforts on the best pickings.

A Hadoop tool from Cloudera Inc. called Navigator Optimizer looks to help with that. It grew out of the company's 2015 purchase of Xplain.io, a company that sought to bring into Hadoop some database optimization capabilities akin to ones long familiar with SQL. The product went into general availability this summer.

"The tool allows people to start looking at the queries that are running on other platforms and see how they would act in our Hadoop environment," said Peter Wojciechowski, software engineering manager at Conversant LLC, a Chicago-based digital marketing company that churns through large amounts of data to present personalized ads to web users.

Conversant in queries

Conversant originally employed Hadoop as a first landing place for data, with processing then performed by a Pivotal Greenplum data warehouse for analytics. Using Navigator Optimizer, teams have been able to move some jobs to Hadoop and Apache Hive data warehousing, and Impala SQL query environments.

"Now, the core ETL, as well as some large processing tasks, happens in a Hadoop cluster," Wojciechowski said. He added that highly iterative processing tasks are good targets for Hadoop. Wojciechowski emphasized that Greenplum hasn't been displaced. It's still responsible for important analytics at Conversant. But now, its use is more refined.

"Before, Greenplum did all the workloads, but not all of them were well-suited for it," Wojciechowski said. "Now, with this tool, we are able to look and see, for example, what is good to run in Hive."

For companies that have thousands of queries, optimizing queries isn't trivial work.
James Curtisanalyst, 451 Research

By using the Optimizer software, Wojciechowski and his team can tell how well queries will perform in Hive or Impala, as well as receive guidance on how the queries execute in the new environment. The Hadoop tool has further use in production. Optimizer working together with Navigator ''tells you how to group workloads of queries so you can find more duplication and make more efficient use of the cluster," he said.

Technology like Navigator Optimizer helps uncover data joins, a common SQL trait that can stymie Hadoop, according to James Curtis, an analyst at 451 Research. "Navigator analyzes existing jobs and, for example, estimates the number of joins that should be redone before moving jobs to Impala or Hive," he said.

He agrees that the tool has a role in job migration, but emphasizes that optimizing queries has far broader use than just for migration. "For companies that have thousands of queries, optimizing queries isn't trivial work," Curtis said.

Shift and lift

The usefulness of products like the Cloudera Navigator suite can span to include one of the most difficult migration jobs of all: moving mainframe data into the Hadoop ecosystem.

To that end, mainframe and Hadoop data transformation specialist Syncsort Inc. said last month that it was working with Cloudera to improve data governance by connecting Navigator to its tools for tracking data lineage from legacy sources. Such legacy sources are not limited to the mainframe but include data warehouses running on midrange systems.

Cloudera isn't alone in providing tools to ease the move of relational data warehouse jobs to the world of Hadoop. The area is active.

For their parts, independent Hadoop distribution competitors Hortonworks Inc. and MapR Technologies Inc. offer related Hadoop tools, including SQL optimization tools based on Apache Calcite, an open source project that includes a SQL parser and query planner and that just celebrated its first birthday.

What's more, data management services company Bitwise recently introduced Hydrograph, a tool meant to streamline the offloading of ETL workloads to Hadoop and other big data frameworks. Developed along with customer Capital One, Bitwise's software is based on a development environment that employs XML interfaces, so that jobs can be moved to different Hadoop frameworks -- for example, from MapReduce to Tez -- with minimal reconfiguration.

If Hadoop tools like these can move migration designs beyond whiteboards and trial-and-error, the uptake for Hadoop could improve. For Hadoop and its ecosystem components in the enterprise, efficiently getting data warehouse work on the platform remains an important first step.

Next Steps

Learn about an application management tool for Hadoop

Find out how a mobile gaming company manages Hadoop clusters

Learn more about a reference architecture for offloading ETL

Dig Deeper on Hadoop framework