The essence of data governance is creating corporate data policies and ensuring that people comply with them. Such policies can span an array of intents, including directives on data protection, validation and usage. Data stewards and data governance managers must solicit data requirements from business users and work with data governance council members to agree on common data definitions, specify data quality metrics, articulate associated policies and develop approaches to measure compliance.
It's a big challenge, though, to bridge the gap between defining data governance policies and implementing them. The policies are intended to assert control and oversight over the quality of data assets across business workflows, yet data stewards are often tasked with critical data quality management responsibilities without being given the proper training or technology.
That's where data lineage tools come in. Metadata-based data lineage information documents the journey that data objects take through an organization's systems. Lineage records can help data analysts and other end users understand the data they work with, but it also simplifies two critical data governance procedures: analyzing the root causes of data quality issues and the impact of changes to data sets in source systems.
Data lineage and data governance
Without a way to determine where data errors are introduced into a data management environment, data stewards and data quality analysts will find it difficult to identify and fix them. That has consequences: If data flaws continue to propagate in systems, the organization may be plagued by inconsistent or inaccurate analytics and reporting that lead to bad decision-making in business operations.
In the root cause analysis process, data lineage tools provide visibility into the sequence of processing stages through which the data that's being checked flows. The quality of the data can be examined at each stage, enabling data governance and data quality teams to find the points where data errors originate.
Working backward from where an error is first identified, a data steward can insert controls at earlier points to monitor whether the data conformed to the defined expectations then or included the error. By pinpointing the processing stage at which the data was compliant upon entry but flawed upon exit, the data steward and other workers involved in a data governance program can focus on eliminating the root cause instead of just correcting the bad data.
Data lineage tools can also help them do impact analysis to stay on top of issues caused by changes to source data formats and structures in data management environments, which often are much more dynamic now than they were in the past.
When source data changes, there may be unintended consequences downstream. By working forward from the point of data creation or collection, a data steward can rely on data lineage documentation to help trace data dependencies and determine the processing stages that are affected by the changes in the data. This enables the data governance and data management teams to reengineer the affected stages to accommodate the changes and ensure that data remains consistent in different systems.
What to look for in data lineage tools
Manually collecting metadata and documenting data lineage requires a significant resource investment. It's also prone to error, which can cause big problems, especially in organizations that rely on data analytics to drive business operations. As a result, it aids data governance efforts to seek out tools that manage the representations of your data's lineage and automatically map them across the enterprise.
During the technology evaluation process, you should look for data lineage tools that:
- Natively access a broad array of data sources and data products, survey the metadata they contain and collect it for data governance uses.
- Aggregate the captured metadata into a centralized repository.
- Infer data types and match common uses of reference data to data elements from different systems.
- Provide simplified presentations of the aggregated metadata to a variety of end users and support collaborative efforts to validate the metadata descriptions.
- Document the end-to-end mapping of how data flows through your organization's processing streams.
- Generate visualized representations of data lineage.
- Include APIs for developers to use in building applications that can query the lineage records.
- Create an inverted index to map data element names to their uses in different processing stages.
- Offer a search capability to rapidly trace the flow of data from its origination point to its downstream targets.
- Enable users to monitor data flows both forward and backward.
Data lineage technology options
There's a plethora of product options to consider. Tools for documenting and managing data lineage are part of the data management platforms sold by large IT vendors, including IBM, Informatica, Information Builders, Oracle, SAP and SAS Institute. They're also offered by smaller software vendors that focus on data integration, quality and governance, such as Adaptive, ASG Technologies, Collibra, Erwin, Infogix and Talend, and by metadata management specialists like Alex Solutions and Octopai.
In addition, data lineage capabilities are built into the data catalog software developed by companies that include Alation and Waterline Data. For data engineers and analytics teams, DataRobot's Paxata unit, Trifacta and other data preparation vendors also incorporate data lineage functions into their products, as do various vendors of BI and analytics tools.