There are many reasons why you might need to understand the history -- or lineage -- of a piece of data. If you see a sales figure in a report then, you want to know where that figure came from, what system or systems stored it and when it was produced.

Just as your passport has stamps that indicate which countries you have visited, a data element similarly has a journey through the systems of a company. From where it was first entered to being copied to other systems, each stage of the journey has a timestamp, just like the stamps on your passport show the progress of your travels around the world. This ability to trace the path of data through an enterprise is called data lineage.

Data may go through one or more business processes and have controls applied to it at different stages, such as data quality validation -- e.g., verifying a postcode or checking that a value is within a valid range. Think of data lineage documentation as a kind of treasure map for your data that shows the passage of data through your systems.

Data lineage tools How does an enterprise provide such a data audit trail or map of data's journey? It is possible to draw diagrams to illustrate the flow of data from system to system, but to do so manually would be impractical at scale. Fortunately, there is no shortage of data lineage tools to help. There are many meta repositories from vendors such as Collibra, Alation, Infogix, Erwin and others. There are open source tools too, such as data lineage tools from Octopai and Talend. These tools vary, but they all provide at least some degree of assistance with tracing data lineage. Software such as this can automatically search database catalogs and, in some cases, even program code in order to produce dependency diagrams and visually show the data lifecycle. In one case study, a large American bank estimated that their compliance project would have taken 80 times more effort had it been done without the use of automated tools.

Who is responsible for data lineage? Deciding who is responsible for data lineage is important. Ideally data lineage documentation would sit within the remit of the data governance team. Data governance bodies define the ownership of data in an enterprise, usually with a steering group consisting of a small team to coordinate things and a network of data stewards embedded within the business lines. Those on the data governance team are responsible for determining the golden copy versions of key data like customer or product hierarchies and the quality of data, so adding data lineage documentation to these responsibilities seems like a logical extension. Otherwise responsibility may end up in the data management function of the IT department, who may lack the business knowledge to prioritize things.