Sergey Nivens - Fotolia
Published: 07 Jun 2017
Data governance is a big issue for mainstream big data users. And lately, IT vendors have touted open source as well as commercial data governance tools to govern data in Hadoop-based data lakes. But how do IT pros sift through all these products to automate governance, metadata management and other big data tasks, and just how attainable -- and useful -- are they?
The buzz: Big data governance options grow
One open source approach comes from Hadoop platform vendor Hortonworks Inc., which created Apache Atlas as an open source data governance framework that provides a "scalable and extensible set of core foundational services."
According to the company, Apache Atlas not only allows enterprises to meet their Hadoop compliance requirements, but also integrates with the entire enterprise data ecosystem. Hortonworks touts the involvement of software developers from some large and high-profile companies in the Atlas open source project -- for example, Aetna, Merck, Schlumberger and Target.
"As you start to do enterprise applications driven by big data," Hortonworks CTO Scott Gnau explained, "security and governance and operational sophistication become much more important." The Atlas technology is "broad enough and capable enough that we're seeing a surge in customers for deployments," he claimed.
Commercially, Hortonworks rival Cloudera has released Cloudera Navigator, which it calls "the only complete data governance solution for Apache Hadoop." Part of Cloudera Enterprise, the company's big data platform, Navigator features data discovery, continuous optimization, audit, data lineage, metadata management and policy enforcement capabilities.
In addition to Atlas and Cloudera Navigator, other commercial and open source technologies that support data governance in big data environments include the following:
- Apache Falcon, an open source tool that centralizes data lifecycle management in Hadoop clusters.
- Data governance software from various vendors, including Collibra, Datameer, Informatica, SAS and Talend.
- Data lake management platforms with embedded governance functions from vendors such as Podium Data, Teradata and Zaloni.
Eric Anderson, executive director of data at Beachbody LLC, said the maker of fitness and nutrition products uses technologies such as Falcon and HCatalog -- a table and storage management layer tied to Apache Hive -- to document metadata and data lineage in its analytics systems. Beachbody started running a Hadoop-based data lake on the Amazon Web Services cloud in December 2016 to support new big data analytics applications.
The reality: Hadoop ecosystem immature
The data governance capabilities provided by the available technologies might be good enough for now, "but some of the tools aren't as mature as desired," Anderson said. "If we want end-to-end data lineage, it's hard to get there yet." His ultimate goal, he noted, is to attain "full lineage in terms of understanding data sources across all touch points."
According to a 2016 Gartner survey, enterprises have shifted their focus from big data itself to specific business problems that big data can solve, which could affect deployments of data governance tools for Hadoop systems. While 48% of companies surveyed slightly increased their spending on big data from 2015 to 2016, those that plan to invest within the next two years fell from 31% to 25%.
Big data or no, data governance is a complicated process that involves much more than deploying software; for example, agreeing on common policies and procedures for using data is a key element of governance initiatives -- and often a big challenge. Combined with that, the lack of maturity in Hadoop data governance tools could prove detrimental to any momentum they're gaining in the enterprise. Even Gnau acknowledges that Atlas still has some maturing to do: "It isn't long in the tooth, for sure," he said.
Hadoop market consolidates
The Hortonworks release cadence
Which Hadoop is right for you?