- Share this item with your network:
- Download
Business Information
- NewsCustomer data mining rules changing in Trump's corporations-first era
- NewsGDPR and marketing: Keep an eye on European regulations in the works
- FeatureCustomer data governance policies: Stop stalking, start selling
- FeatureCustomer data mining needs a governance plan to propel sales
- FeatureSAP data as a service helps wine shop seek truth about customers
- FeatureData lake management, governance a hands-on job for big data teams
- FeatureHadoop data governance takes hold in companies as data gets 'bigger'
- FeatureBig data queries eyed to head off Hadoop performance problems
- FeatureData governance tools for Hadoop infiltrate the enterprise
- OpinionDon't wait until it's too late to build a data governance model
- OpinionBI data governance can take lessons from public sector practices
- OpinionData managers apply data governance principles on the sly

Sergey Nivens - Fotolia
Data governance tools for Hadoop infiltrate the enterprise
Be it open source or commercial technology, software designed to ensure proper data governance in Hadoop data lakes is proliferating. But like many big data systems, these tools are still maturing.
Data governance is a big issue for mainstream big data users. And lately, IT vendors have touted open source as well as commercial data governance tools to govern data in Hadoop-based data lakes. But how do IT pros sift through all these products to automate governance, metadata management and other big data tasks, and just how attainable -- and useful -- are they?
The buzz: Big data governance options grow
One open source approach comes from Hadoop platform vendor Hortonworks Inc., which created Apache Atlas as an open source data governance framework that provides a "scalable and extensible set of core foundational services."
According to the company, Apache Atlas not only allows enterprises to meet their Hadoop compliance requirements, but also integrates with the entire enterprise data ecosystem. Hortonworks touts the involvement of software developers from some large and high-profile companies in the Atlas open source project -- for example, Aetna, Merck, Schlumberger and Target.
"As you start to do enterprise applications driven by big data," Hortonworks CTO Scott Gnau explained, "security and governance and operational sophistication become much more important." The Atlas technology is "broad enough and capable enough that we're seeing a surge in customers for deployments," he claimed.
Commercially, Hortonworks rival Cloudera has released Cloudera Navigator, which it calls "the only complete data governance solution for Apache Hadoop." Part of Cloudera Enterprise, the company's big data platform, Navigator features data discovery, continuous optimization, audit, data lineage, metadata management and policy enforcement capabilities.
In addition to Atlas and Cloudera Navigator, other commercial and open source technologies that support data governance in big data environments include the following:
- Apache Falcon, an open source tool that centralizes data lifecycle management in Hadoop clusters.
- Data governance software from various vendors, including Collibra, Datameer, Informatica, SAS and Talend.
- Data lake management platforms with embedded governance functions from vendors such as Podium Data, Teradata and Zaloni.
Eric Anderson, executive director of data at Beachbody LLC, said the maker of fitness and nutrition products uses technologies such as Falcon and HCatalog -- a table and storage management layer tied to Apache Hive -- to document metadata and data lineage in its analytics systems. Beachbody started running a Hadoop-based data lake on the Amazon Web Services cloud in December 2016 to support new big data analytics applications.
The reality: Hadoop ecosystem immature
The data governance capabilities provided by the available technologies might be good enough for now, "but some of the tools aren't as mature as desired," Anderson said. "If we want end-to-end data lineage, it's hard to get there yet." His ultimate goal, he noted, is to attain "full lineage in terms of understanding data sources across all touch points."
According to a 2016 Gartner survey, enterprises have shifted their focus from big data itself to specific business problems that big data can solve, which could affect deployments of data governance tools for Hadoop systems. While 48% of companies surveyed slightly increased their spending on big data from 2015 to 2016, those that plan to invest within the next two years fell from 31% to 25%.
Big data or no, data governance is a complicated process that involves much more than deploying software; for example, agreeing on common policies and procedures for using data is a key element of governance initiatives -- and often a big challenge. Combined with that, the lack of maturity in Hadoop data governance tools could prove detrimental to any momentum they're gaining in the enterprise. Even Gnau acknowledges that Atlas still has some maturing to do: "It isn't long in the tooth, for sure," he said.
Next Steps
Hadoop market consolidates
The Hortonworks release cadence
Which Hadoop is right for you?