Get started Bring yourself up to speed with our introductory content.

How data lineage tools boost data governance policies

Organizations can bolster data governance efforts by tracking the lineage of data in their systems. Get advice on how to do so and key features in data lineage tools.

The essence of data governance is creating corporate data policies and ensuring that people comply with them. Such policies can span an array of intents, including directives on data protection, validation and usage. Data stewards and data governance managers must solicit data requirements from business users and work with data governance council members to agree on common data definitions, specify data quality metrics, articulate associated policies and develop approaches to measure compliance.

It's a big challenge, though, to bridge the gap between defining data governance policies and implementing them. The policies are intended to assert control and oversight over the quality of data assets across business workflows, yet data stewards are often tasked with critical data quality management responsibilities without being given the proper training or technology.

That's where data lineage tools come in. Metadata-based data lineage information documents the journey that data objects take through an organization's systems. Lineage records can help data analysts and other end users understand the data they work with, but it also simplifies two critical data governance procedures: analyzing the root causes of data quality issues and the impact of changes to data sets in source systems.

Data lineage and data governance

Without a way to determine where data errors are introduced into a data management environment, data stewards and data quality analysts will find it difficult to identify and fix them. That has consequences: If data flaws continue to propagate in systems, the organization may be plagued by inconsistent or inaccurate analytics and reporting that lead to bad decision-making in business operations.

Data lineage and data lineage software basics
Key details on data lineage and related software tools

In the root cause analysis process, data lineage tools provide visibility into the sequence of processing stages through which the data that's being checked flows. The quality of the data can be examined at each stage, enabling data governance and data quality teams to find the points where data errors originate.

Working backward from where an error is first identified, a data steward can insert controls at earlier points to monitor whether the data conformed to the defined expectations then or included the error. By pinpointing the processing stage at which the data was compliant upon entry but flawed upon exit, the data steward and other workers involved in a data governance program can focus on eliminating the root cause instead of just correcting the bad data.

Data lineage tools can also help them do impact analysis to stay on top of issues caused by changes to source data formats and structures in data management environments, which often are much more dynamic now than they were in the past.

It aids data governance efforts to seek out tools that manage the representations of your data's lineage and automatically map them across the enterprise.

When source data changes, there may be unintended consequences downstream. By working forward from the point of data creation or collection, a data steward can rely on data lineage documentation to help trace data dependencies and determine the processing stages that are affected by the changes in the data. This enables the data governance and data management teams to reengineer the affected stages to accommodate the changes and ensure that data remains consistent in different systems.

What to look for in data lineage tools

Manually collecting metadata and documenting data lineage requires a significant resource investment. It's also prone to error, which can cause big problems, especially in organizations that rely on data analytics to drive business operations. As a result, it aids data governance efforts to seek out tools that manage the representations of your data's lineage and automatically map them across the enterprise.

During the technology evaluation process, you should look for data lineage tools that:

  • Natively access a broad array of data sources and data products, survey the metadata they contain and collect it for data governance uses.
  • Aggregate the captured metadata into a centralized repository.
  • Infer data types and match common uses of reference data to data elements from different systems.
  • Provide simplified presentations of the aggregated metadata to a variety of end users and support collaborative efforts to validate the metadata descriptions.
  • Document the end-to-end mapping of how data flows through your organization's processing streams.
  • Generate visualized representations of data lineage.
  • Include APIs for developers to use in building applications that can query the lineage records.
  • Create an inverted index to map data element names to their uses in different processing stages.
  • Offer a search capability to rapidly trace the flow of data from its origination point to its downstream targets.
  • Enable users to monitor data flows both forward and backward.

Data lineage technology options

There's a plethora of product options to consider. Tools for documenting and managing data lineage are part of the data management platforms sold by large IT vendors, including IBM, Informatica, Information Builders, Oracle, SAP and SAS Institute. They're also offered by smaller software vendors that focus on data integration, quality and governance, such as Adaptive, ASG Technologies, Collibra, Erwin, Infogix and Talend, and by metadata management specialists like Alex Solutions and Octopai.

In addition, data lineage capabilities are built into the data catalog software developed by companies that include Alation and Waterline Data. For data engineers and analytics teams, DataRobot's Paxata unit, Trifacta and other data preparation vendors also incorporate data lineage functions into their products, as do various vendors of BI and analytics tools.

Dig Deeper on Data governance strategy

Join the conversation

5 comments

Send me notifications when other members comment.

Please create a username to comment.

What do you see as the biggest challenges in documenting and managing data lineage information?
Cancel
One of the biggest challenges in my practice was buying into a vendor pitch to automate everything, only to find out later that nothing is "automagic", and an unforeseen manual effort by a team of SME' s is required to "prime" the system. It's like you bought a smartphone and try sending an SMS to your contacts without loading them first. 
Cancel
Totally agree. Complete automation is a must, otherwise it's adding more to our already full plates! Octopai is the only vendor we've seen that offers complete automated data lineage + discovery + business glossary without any professional services or installation required. We were using the product for impact analysis within an hour or so.
Cancel
I think to maintain the security of the data is the challenge that need to be taken care of. Also with the advancement of Big Data tools this challenge can be look after.
Cancel
Automation is the key, however just the Metadata is not where the secret sauce lies. This is an important step yes, but true content indexing and classification will be the key driver to move and act on the data lineage. Time, User, File type, Creator, size, all good intel, but to truly evaluate an action you must understand the content as well. Aparavi Platform will collect all metadata of every file and maintain a content index with global classification. Automatic collection, classification, smart policies will allow actions to be made on the value or classification element and Cut/Copy/Delete files based on the intelligence. Also the Platform is cloud agnostic and can use any S3 compatible on prem storage as well. There is value in the metadata for sure, but don't forget about the contents, as this is where most of the risk is.
Cancel

-ADS BY GOOGLE

SearchBusinessAnalytics

SearchAWS

SearchContentManagement

SearchOracle

SearchSAP

SearchSQLServer

Close