The conventional wisdom in data management holds that the critical dimensions of data quality include accuracy, completeness, timeliness and consistency. But what does data consistency really mean? What are its implications for data management teams? And what kind of data integration techniques can you use to operationalize it across multiple applications?
The need for data consistency is predicated on the condition that the same source data sets are often used in multiple ways for different BI and analytics applications. For example, a set of daily sales transactions could be used for operational reporting to compare sales of similar products across different retail locations and for advanced analytics applications, such as developing customer profiles based on the types of products that people have purchased.
The same transaction data is used for both types of applications, yet the processes that are used to extract, transform, reorganize and then transmit the data to the downstream applications pose risks of introducing at least four different forms of data inconsistency.
Ways to go wrong on data integration
There's temporal inconsistency, which can result from extracting data from a source system at different times; doing so means modifications made to the data after one extraction will be included in subsequent ones.
Structural inconsistency involves data values that get put into different formats during the data integration process -- for instance, when ZIP codes are mistakenly inferred to be another data type and those that begin with a zero are transformed into four-digit numbers.
Data lakes add to challenges.
Semantic inconsistency happens when the target applications interpret what is meant by the data differently. The use of the term customer, for example, can have different definitions depending on whether data originates in marketing, sales or customer support systems.
Transformation inconsistency occurs when data integration jobs apply different transformations to the same source data -- for example, based on separate business rules that conditionally alter attribute values.
In many cases, an absence of coordination among application and data integration developers allows such inconsistencies to creep into the target applications. That's a symptom of ungoverned data management practices: Siloed business operations develop their data integration processes in a virtual vacuum, with infrequent communication across functional lines.
Therefore, the first step is to document the existing data integration techniques and practices used throughout your organization. Identify what integration jobs are accessing data sources, under what circumstances and at what times. Doing so will enable data stewards to examine the various integration streams to determine if there are any obvious conflicts that might lead to data inconsistencies.
Take control of the integration process
As a second step, ensuring data consistency across applications requires controls to prevent the introduction of inconsistencies. That calls for instituting policies and procedures to combat the effects of uncoordinated data integration sequences. Here are some ideas on how to use new integration techniques to accomplish that and avoid each of the four types of data inconsistencies outlined earlier:
- Synchronize the data integration streams. If a root cause of temporal inconsistency is asynchronous data extraction from operational or transaction systems, create a synchronized trigger for starting the different integration streams. This approach can accommodate scheduled extractions and streaming data, the latter by caching data and releasing it to all the integration jobs at the same time.
- Standardize formats for data values. Specifying storage and presentation standards for data formats is intended to reduce differences introduced by application developers working independently. Data practitioners can develop enterprise-wide services that validate data values against defined format standards, thereby reducing structural inconsistency.
- Harmonize data definitions. Variations in data definitions -- or worse, the absence of them -- enable data users to presume they can impose their own meanings on data sets. Internal collaboration to identify, review and harmonize different data definitions will help reduce semantic inconsistency.
- Coordinate data transformations. Many data integration jobs are driven by defined business rules. A review of different sets of business rules by data stewards can help identify situations in which inconsistent transformations might be applied. These can either be harmonized so the results remain consistent or qualified so the outputs are specified independently to recognize data inconsistencies.
Ultimately, combining data standards, data validation, good data stewardship practices and well-planned data integration techniques will introduce much-needed discipline -- and consistency -- to your integration processes.