Sergey Nivens - Fotolia
Three data-related processes are getting increased attention in many organizations: big data management, Agile development and data governance. Unfortunately, they often operate at loggerheads with one another.
In such cases, the desire for application development rapidity diminishes the perceived need for upfront data governance; application developers figure it can be folded in at a later point in time. Yet the diversity and fast-changing nature of big data environments that absorb information from a variety of data sources and streams demand some type of governance controls to ensure consistency in downstream analytical results.
The upshot is that enterprises must adapt their data governance initiatives to align with a combination of increased data volumes and faster development cycles. One adaptation that's worth considering is leveraging an ontological approach to tracking information use and modeling data flows that support the Agile methodology, thereby enabling data governance mechanisms to be incorporated into big data applications from the start.
There are two key aspects to data lineage. One is structural, providing a mapping of entity and data element concepts to instantiated database tables (or other entity instances, in the case of unstructured data), and to physical data elements as they're used across the application landscape. The second involves data flow -- how data moves from the creation or acquisition point to various points of persistence in systems -- which is used to generate results in both operational and analytical applications.
Getting a handle on data lineage is crucial to operationalizing data governance, especially in Hadoop data lakes and other big data systems. In essence, proper data lineage practices will provide a map of an organization's data sets, along with semantic information about the data elements that are in those data sets. By applying governance controls to the documented data flows, internal standards can be enforced regarding data usage.
In addition, data lineage management addresses some of the key issues that characterize the struggle between development agility and effective data governance initiatives, including the following:
Data descriptions. A metadata repository is designed to document the structure of data artifacts, but metadata management tools are typically limited to describing a static view of persisted data sets.
On the other hand, data lineage is intended to give users visibility into how data structures represent real-world information concepts, including the ways that the interpretation and representation of those concepts change as data flows through a system architecture. It provides a holistic description of the progression from data model to physical instantiation over the entire data lifecycle.
Data dependencies. The ability to document and track data lineage yields a more critical visibility into the ways that applications and data sets depend on each other. Seeing that a data element collected via a data stream provides the same entity attribution as one ingested as part of a bulk data load establishes congruity among different data sets, which then justifies attempts to integrate and reuse data with the appropriate governance rules applied.
Semantic consistency. Alternatively, data lineage coupled with metadata management helps to identify similarly named data elements that represent different concepts, thus recommending against their conflation through a data integration process.
The combination also provides the context for data architects to see where related data elements aren't aligned from a semantic perspective, as well as guidance on eliminating such inconsistencies.
Impact analysis. By virtue of the mapping from data concept to actual instantiation, data lineage can indicate which applications and databases will be affected when a change is made to a data definition or specification. This is one way that data governance initiatives can contribute to increased development agility; by helping to simplify the planning process for modifying applications.
Although data lineage management is increasingly recognized as an emerging component of well-designed data governance programs, the jury is still out about the best implementation method. That's due to three challenges facing IT teams: the need for an appropriate representational model, the need for defined procedures to follow when designing and developing systems in order to update the data lineage repository, and the need to define how the data lineage repository will be used in practice.
We'll examine some different approaches to designing a data lineage model in future articles.
Big data drives a need to fine-tune data governance procedures
A lack of clear data lineage can cripple data science applications