This article originally appeared on the BeyeNETWORK.
I recently read yet another article about data quality. This article promoted extracting data from source databases, verifying data in a staging area, correcting data in a staging area, transforming data, “certifying” data (as valid according to a set of business rules) and publishing data in a reporting database. This was called “the data quality process.” Data cleansing is not information quality.
The problem with this, though, is that it is neither information quality, nor data quality. This author was simply describing “automated information scrap and rework.” The supposed “data quality process” fails to meet the criteria of a sound quality management system.
Here are the fundamental flaws in this as a “data quality process”:
It attacks the symptom, instead of the root cause. Since there were no attempts to correct data (corrective maintenance) at the source, any corrections made to the data in the downstream reporting database will not stop the defective records. These records will remain in the source and continue to cause processes that use that source data to continue to fail. Furthermore, the processes producing the defective data will continue to produce defective data.
No automated process can identify and correct all errors. Automated data correction techniques can only change data to conform to business rules or change them to data in a reference dataset. An example of this is postal addresses, which are deemed correct. But there can be errors in those reference datasets. Although the automated validity correction can actually change a value to a “valid” value, the value is not necessarily “correct.” Assigning a gender status of “male” and a personal title of “Mr.” to George A. Davis would have actually introduced an error, alienating her and possibly losing her as a customer.
By focusing on data correction instead of process improvement, all money spent on the back end is suboptimized. Without conducting a correct process improvement initiative on the source processes, errors will continue at the same rate. These errors might even increase if information producers know there is a back-end process for correcting them.
Every valid quality management system has changed from an approach of “inspect and correct” (scrap and rework) to “defect prevention” (process improvement) to prevent the causes of defects.
Problems Caused by this Process
There are several problems caused by the aforementioned approach:
In the name of data quality, the process described actually introduces a new information quality problem. The data is now “inconsistent” or “nonequivalent” from the source data store to the downstream redundant data store.
- Not all types of errors, including accuracy and comprehensiveness of data about events, can be corrected. Errors will remain, or even be introduced by the data cleansing or transformation software.
- It suboptimizes the cost of data correction since information customers still using source data at the source do not benefit from the corrections.
- Defective source data remaining in the source can subsequently be spread in a way that corrupts or introduces errors in the downstream data store.
- Correcting the defective data only perpetuates a permanent data sanitation program, rather than solving broken processes that produce the defects.
The Information Quality Process
To solve information quality problems when you have defective data:
Treat all data cleansing efforts as one-time events. You should correct data at its source, if it is still used there. The only exception, though, is if legal requirements prevent you from altering the data at its source. If this happens, you must maintain the corrected data as an alternate for the processes that require correct data.
- Conduct a process improvement initiative on the processes at the source. This can be done using a Pareto approach to identify and attack the most problematic processes first. This involves a fundamental Plan-Do-Check-Act process improvement to analyze root causes and define improvements that eliminate the causes of errors (Improving Data Warehouse and Business Information Quality).
- As Deming’s third point of quality states, “Quality comes not from inspection, but from improvement of the production process. Inspection, scrap, downgrading, and rework are not corrective action on the process (Out of the Crisis).”
- Using the process, establish monitoring and real-time feedback to correct the data at the point of data creation. For example, the same business rules used for downstream audit checks should be invoked by the applications that create the data.
- Process improvements also look at the business process outside of the application for procedure clarity, checklists, form design, training, management accountability and other sources causing defects.
If someone wishes to only address data cleansing, they should call their method a “data cleansing,” “data inspection and correction” or “data corrective maintenance” process.
If you want a “data quality” or “information quality” process, however, you must understand and apply quality management principles including a strong process improvement method, based on the Shewhart Cycle of Plan-Do-Check/Study-Act or the Six Sigma variation of DMAIC (Define-Measure-Analyze-Improve-Control).
This is the main process that can lead you to become an Intelligent Learning Organization.
Please share your ideas for the intelligent learning organization at Larry.English@infoimpact.com.