Sergey Nivens - Fotolia

Evaluate Weigh the pros and cons of technologies, products and projects you are considering.

Data governance tools for Hadoop infiltrate the enterprise

Be it open source or commercial technology, software designed to ensure proper data governance in Hadoop data lakes is proliferating. But like many big data systems, these tools are still maturing.

This article can also be found in the Premium Editorial Download: Business Information: Data governance programs loom larger than ever in the big data era:

Data governance is a big issue for mainstream big data users. And lately, IT vendors have touted open source as well as commercial data governance tools to govern data in Hadoop-based data lakes. But how do IT pros sift through all these products to automate governance, metadata management and other big data tasks, and just how attainable -- and useful -- are they?

The buzz: Big data governance options grow

One open source approach comes from Hadoop platform vendor Hortonworks Inc., which created Apache Atlas as an open source data governance framework that provides a "scalable and extensible set of core foundational services."

According to the company, Apache Atlas not only allows enterprises to meet their Hadoop compliance requirements, but also integrates with the entire enterprise data ecosystem. Hortonworks touts the involvement of software developers from some large and high-profile companies in the Atlas open source project -- for example, Aetna, Merck, Schlumberger and Target.

"As you start to do enterprise applications driven by big data," Hortonworks CTO Scott Gnau explained, "security and governance and operational sophistication become much more important." The Atlas technology is "broad enough and capable enough that we're seeing a surge in customers for deployments," he claimed.

Commercially, Hortonworks rival Cloudera has released Cloudera Navigator, which it calls "the only complete data governance solution for Apache Hadoop." Part of Cloudera Enterprise, the company's big data platform, Navigator features data discovery, continuous optimization, audit, data lineage, metadata management and policy enforcement capabilities.

In addition to Atlas and Cloudera Navigator, other commercial and open source technologies that support data governance in big data environments include the following:

  • Apache Falcon, an open source tool that centralizes data lifecycle management in Hadoop clusters.
  • Data governance software from various vendors, including Collibra, Datameer, Informatica, SAS and Talend.
  • Data lake management platforms with embedded governance functions from vendors such as Podium Data, Teradata and Zaloni.

Eric Anderson, executive director of data at Beachbody LLC, said the maker of fitness and nutrition products uses technologies such as Falcon and HCatalog -- a table and storage management layer tied to Apache Hive -- to document metadata and data lineage in its analytics systems. Beachbody started running a Hadoop-based data lake on the Amazon Web Services cloud in December 2016 to support new big data analytics applications.

The reality: Hadoop ecosystem immature

The data governance capabilities provided by the available technologies might be good enough for now, "but some of the tools aren't as mature as desired," Anderson said. "If we want end-to-end data lineage, it's hard to get there yet." His ultimate goal, he noted, is to attain "full lineage in terms of understanding data sources across all touch points."

According to a 2016 Gartner survey, enterprises have shifted their focus from big data itself to specific business problems that big data can solve, which could affect deployments of data governance tools for Hadoop systems. While 48% of companies surveyed slightly increased their spending on big data from 2015 to 2016, those that plan to invest within the next two years fell from 31% to 25%.

Big data or no, data governance is a complicated process that involves much more than deploying software; for example, agreeing on common policies and procedures for using data is a key element of governance initiatives -- and often a big challenge. Combined with that, the lack of maturity in Hadoop data governance tools could prove detrimental to any momentum they're gaining in the enterprise. Even Gnau acknowledges that Atlas still has some maturing to do: "It isn't long in the tooth, for sure," he said.

Next Steps

Hadoop market consolidates

The Hortonworks release cadence

Which Hadoop is right for you? 

This was last published in June 2017

Dig Deeper on Hadoop framework

PRO+

Content

Find more PRO+ content and other member only offers, here.

Join the conversation

5 comments

Send me notifications when other members comment.

By submitting you agree to receive email from TechTarget and its partners. If you reside outside of the United States, you consent to having your personal data transferred to and processed in the United States. Privacy

Please create a username to comment.

What type of data governance tools are in your enterprise?
Cancel
Besides Atlas and Navigator are there any recommendation? HCatalog, Falcon can serve but customization is needed.
Cancel
Many companies actually claim to offer data governance for Hadoop, how ready the tools are is the question. Informatica and Collibra are more examples.
Cancel
In one project, I worked, MM was recommended for Talend studio. I do not know how far it can stretch.
Cancel
We use IVM from DATUM and are having good success
Cancel

-ADS BY GOOGLE

SearchBusinessAnalytics

SearchAWS

SearchContentManagement

SearchOracle

SearchSAP

SearchSQLServer

Close