microworks - Fotolia
The essence of data governance is corporate data policy compliance. Data policies can span an array of intents, including directives about data protection and data validation. Data stewards and data governance council members must solicit requirements from business consumers, as well as articulate data quality requirements, specify data quality metrics and develop approaches to measure data policy compliance.
The challenge, though, is to bridge the gap between defining data governance policies and implementing them. Policies are intended to assert control and oversight over the quality of data assets across production flows, yet data stewards are often tasked with critical data quality management responsibilities without the proper training or tools.
This is where data lineage tools come in. Data lineage documents the journey that data objects take through the enterprise and help to simplify two critical data governance procedures: root cause analysis and impact analysis.
Data lineage and data governance
Without a way to identify where data errors are introduced into the environment, data stewards will find it difficult to identify and fix data quality issues. When data flaws continue to propagate, the organization can be plagued by inconsistent reporting and analyses that influence bad decision-making.
Data lineage tools can simplify the root cause analysis process by providing visibility into the sequence of processing stages through which the data flows. The quality of the data can be examined at each point of the processing flow, enabling IT to find the point of introduction of data errors.
Working backward from where the error was identified, the data steward can insert controls at each point to monitor whether the data conformed to the defined expectations or if the error was present. The processing stage at which the data was compliant upon entry but flawed upon exit is the point at which the data error was introduced. Pinpointing this location can enable the data steward to focus on eliminating the root cause instead of just trying to correct the bad data.
A data lineage tool can also help data stewards stay on top of unexpected data source format and structural changes in today's environments, which are much more dynamic than they were in the past. When data sources change, there may be unintended consequences downstream.
By working forward from the point of data acquisition, the data steward can rely on data lineage to help trace the data dependencies and determine the processing stages that have been affected by the change in the data. This can enable the data steward to consider how to re-engineer the processing stage to accommodate the identified change.
What to look for in data lineage tools
Manually collecting metadata and documenting data lineage involves a significant resource investment, but it is prone to error, especially in organizations that rely on reports and analyses to drive operations. Therefore, it can be helpful to seek out data lineage tools and technologies that not only manage the representations of the data lineage, but that can also automatically map them across the enterprise.
Look for products that can:
- Natively access a broad array of data sources and intermediate data products and survey the inherent metadata.
- Aggregate the captured metadata into a centralized repository.
- Infer data element data types and match uses of reference data to the data elements from across different sources that use that reference data.
- Provide simplified presentations of the metadata to a variety of users in the organization and encourage collaborative participation to help validate the metadata descriptions.
- Document the end-to-end mapping of how data flows throughout the processing streams.
- Provide a visual presentation of the data lineage.
- Provide APIs for developers to implement applications that query the lineage
- Create an inverted index to map data element names to their uses in different processing stages.
- Provide a search capability to rapidly trace the flow of data from its origination point to all its downstream targets.
- Provide a trace capability to traverse data flows either forward or backward.