Sergey Nivens - Fotolia
Data governance is a big issue for mainstream big data users. And lately, IT vendors have touted open source as well as commercial data governance tools to govern data in Hadoop-based data lakes. But how do IT pros sift through all these products to automate governance, metadata management and other big data tasks and just how attainable are they?
One open source approach comes from Hadoop vendor Hortonworks Inc., which created Apache Atlas as an open source framework that's a "scalable and extensible set of core foundational services." According to the company, Apache Atlas not only allows enterprises to meet their Hadoop compliance requirements, but also integrates with the entire enterprise data ecosystem. Hortonworks touts the involvement of developers from some large and high-profile companies in the open source project -- for example, Aetna, Merck, Schlumberger and Target.
"As you start to do enterprise applications driven by big data," Hortonworks CTO Scott Gnau explained, "security and governance and operational sophistication become much more important." The Atlas technology is "broad enough and capable enough that we're seeing a surge in customers for deployments."
Commercially, Hortonworks rival Cloudera has released Navigator, which the company calls "the only complete data governance solution for Apache Hadoop" and features data discovery, continuous optimization, audit, lineage, metadata management and policy enforcement. The product is part of Cloudera Enterprise.
Eric Anderson, executive director of data at Beachbody LLC, said the maker of fitness and nutrition products uses technologies such as HCatalog and Apache Falcon to document metadata and data lineage in its analytics systems. Beachbody started running a Hadoop-based data lake on the Amazon Web Services cloud in December 2016.
The capabilities provided by these technologies might be good enough for now, "but some of the tools aren't as mature as desired," Anderson said. "If we want end-to-end data lineage, it's hard to get there yet." His ultimate goal is to attain "full lineage in terms of understanding data sources across all touch points."
According to a 2016 Gartner survey, enterprises have shifted their focus from big data itself to specific business problems that big data can solve, which could affect deployments of data governance tools. While 48% of companies surveyed slightly increased their spending in big data from 2015 to 2016, those that plan to invest within the next two years fell from 31% to 25%.
Lack of maturity in data governance tools could prove detrimental to any momentum they've gained in the enterprise. Even Gnau acknowledges that Atlas still has some maturing to do: "It isn't long in the tooth, for sure."
Hadoop market consolidates
The Hortonworks release cadence
Which Hadoop is right for you?