Whose responsibility is it to ensure data quality? Despite the desire for a simple solution, the answer to that question remains complex.
Even in a constrained environment with only one method of inputting data and one channel for viewing that data, you could propose four alternatives: Make the data creator or supplier responsible for ensuring data inputs adhere to an organization's data quality rules before committing them to the target system; make the system owner responsible for ensuring data meets the quality rules once it has been loaded into the system; make the system owner responsible for ensuring the rules are observed before information based on the data is presented to end users; or make the users responsible for ensuring data complies with the rules before using the presented information.
Attempting to distinguish the qualitative differences between those choices forces you to consider a slightly different question: Whose perception of data quality requirements is being adhered to? Ultimately, you'd like to believe the primary intent of the data quality process is to benefit the users of the data -- so the simplistic response would be that the data consumers should "own" the data quality rules. Even so, it would be beneficial to apply those rules as far back in the information flow as possible -- ideally, all the way back to the data creator.
Data quality picture not always clear
But while that might be optimal, the reality is somewhat murky for three reasons. First, the data originators may have created the data for one purpose, but the rules governing data quality for that original purpose may not necessarily be aligned with the downstream ones at the end-user level. Second, from an administrative standpoint, data producers may not have the resources -- due to both budget and time limits -- to address the data quality issues of users. And third, in many cases, transforming the data collected in a corporate system of record isn't feasible.
Also, recall that we're looking at a constrained model: one source of input and one channel for output. Imagine that constraint being lifted and the number of data providers increasing by a factor of 10 or more, and the number of data users ballooning as well.
In this more realistic scenario, different groups of users often are going to have different sets of data quality rules. If we review the first two of our four alternatives for assigning data quality responsibilities, we face two potential problems: scalability and conflicts. Validation of data on entry into systems now becomes a matter of validating it against all the downstream quality rules, raising the question of whether that can realistically be done while still meeting service-level agreements on processing performance. More seriously, the possibility of internal conflicts arises when different rules are inconsistent with one another.
Unless rigid enterprise data governance policies are in place, implementing data quality assurance processes at the point of data entry may be both a technical and an operational challenge. That leaves us with the third and fourth alternatives, both of which involve applying rules on data quality when data is used, with either the system owner or the users taking responsibility for applying them.
Put the data quality controls in users' hands
If a department or business unit controls the rules, it seems unwieldy to "outsource" enforcement to another party. The only logical conclusion is that users should enforce their rules at the point of consumption. In other words, because data quality is relevant based on the context of how the data is used, this is a case of beauty (or quality) being in the eye of the beholder (or user).
That doesn't free IT data practitioners from having a role in data quality management and assurance. But the particulars of their role must be adjusted tofacilitate the development of a framework for defining and implementing a data quality policy that's specific to each group of users. And data virtualization provides one possible means of balancing the need for unified access to data with customized application of quality rules.
Using data virtualization software, data management professionals can create semantic layers customized to the needs of different groups of users on top of a foundational layer supporting federated access to the underlying systems. Data quality validation, and data transformation and standardization, would then be sandwiched between the two layers. This approach allows the same data to be prepared in different ways that are suited to individual usage models, while retaining a historical record of the application of data quality policies -- and, therefore, traceability and auditability.
As the numerous data sources in the typical corporate enterprise become increasingly critical to business success, the complexity of data quality assurance also has the potential to increase significantly. But by pushing stewardship of data quality rules to the user level and deploying data virtualization tools to help manage the process, you'll have a good chance of ensuring suitable data is available for use and unnecessary conflicts over quality issues don't bog your organization down in acrimonious infighting.
About the author:
David Loshin is president of Knowledge Integrity Inc., a consulting and development services company that works with clients on big data, business intelligence and data management projects. He also is the author or co-author of various books, including Using Information to Develop a Culture of Customer Centricity. Email him at [email protected].
More advice from David Loshin: Five steps for improving your data quality strategy
See why consultant Andy Hayler says businesses need to take data quality to a higher level
Get tips from consultant Lyndsay Wise on creating high-quality business intelligence data