Sergey Nivens - Fotolia

Evaluate Weigh the pros and cons of technologies, products and projects you are considering.

Data governance tools for Hadoop infiltrate the enterprise

Be it open source or commercially available, software that ensures proper data governance in Hadoop-based data lakes is everywhere -- so how do you get the most bang for your buck?

This article can also be found in the Premium Editorial Download: Business Information: Data governance programs loom larger than ever in the big data era:

Data governance is a big issue for mainstream big data users. And lately, IT vendors have touted open source as well as commercial data governance tools to govern data in Hadoop-based data lakes. But how do IT pros sift through all these products to automate governance, metadata management and other big data tasks and just how attainable are they?

The buzz

One open source approach comes from Hadoop vendor Hortonworks Inc., which created Apache Atlas as an open source framework that's a "scalable and extensible set of core foundational services." According to the company, Apache Atlas not only allows enterprises to meet their Hadoop compliance requirements, but also integrates with the entire enterprise data ecosystem. Hortonworks touts the involvement of developers from some large and high-profile companies in the open source project -- for example, Aetna, Merck, Schlumberger and Target.

"As you start to do enterprise applications driven by big data," Hortonworks CTO Scott Gnau explained, "security and governance and operational sophistication become much more important." The Atlas technology is "broad enough and capable enough that we're seeing a surge in customers for deployments."

Commercially, Hortonworks rival Cloudera has released Navigator, which the company calls "the only complete data governance solution for Apache Hadoop" and features data discovery, continuous optimization, audit, lineage, metadata management and policy enforcement. The product is part of Cloudera Enterprise.

Eric Anderson, executive director of data at Beachbody LLC, said the maker of fitness and nutrition products uses technologies such as HCatalog and Apache Falcon to document metadata and data lineage in its analytics systems. Beachbody started running a Hadoop-based data lake on the Amazon Web Services cloud in December 2016.

The reality

The capabilities provided by these technologies might be good enough for now, "but some of the tools aren't as mature as desired," Anderson said. "If we want end-to-end data lineage, it's hard to get there yet." His ultimate goal is to attain "full lineage in terms of understanding data sources across all touch points."

According to a 2016 Gartner survey, enterprises have shifted their focus from big data itself to specific business problems that big data can solve, which could affect deployments of data governance tools. While 48% of companies surveyed slightly increased their spending in big data from 2015 to 2016, those that plan to invest within the next two years fell from 31% to 25%.

Lack of maturity in data governance tools could prove detrimental to any momentum they've gained in the enterprise. Even Gnau acknowledges that Atlas still has some maturing to do: "It isn't long in the tooth, for sure."

Next Steps

Hadoop market consolidates

The Hortonworks release cadence

Which Hadoop is right for you? 

This was last published in June 2017

Dig Deeper on Hadoop framework



Find more PRO+ content and other member only offers, here.

Join the conversation


Send me notifications when other members comment.

By submitting you agree to receive email from TechTarget and its partners. If you reside outside of the United States, you consent to having your personal data transferred to and processed in the United States. Privacy

Please create a username to comment.

What type of data governance tools are in your enterprise?
Besides Atlas and Navigator are there any recommendation? HCatalog, Falcon can serve but customization is needed.
Many companies actually claim to offer data governance for Hadoop, how ready the tools are is the question. Informatica and Collibra are more examples.
In one project, I worked, MM was recommended for Talend studio. I do not know how far it can stretch.