This article originally appeared on the BeyeNETWORK.
The appeal of data quality is great. It is really hard to argue against data quality. Who in their right mind wants bad data in a data warehouse? Why, no one does!
So what’s the problem? Get in there and get that data clean.
While no one argues against data quality, getting it clean is another story.
Our story begins with data entering the system. Usually, data enters the system as a byproduct of doing a transaction – the completion of a sale, the withdrawal of $500 from the bank, the stocking of an item in the retail store, and so forth. Once the transaction is complete, all sorts of data crops up. The data is gathered and put into an application. Once in the application, the data eventually finds its way into a data warehouse.
Let’s say there is then a little miscalculation. Someone divides by x rather than multiplies by x. Someone forgets to put the time of transaction in the application database. Someone puts data into the application in ASCII rather than EBCDIC. Nobody’s perfect.
While the application may well thrive and prosper with a few errors, once the data enters the data warehouse, the errors are sure to be caught.
So let’s go back to the source applications and change the data.
Here’s where the rub comes in. The problem is that the applications were written fifteen years ago. There is no one around that even remembers the application. Everyone is scared that if we make one simple and innocent change in one place that something awful will appear somewhere else.
And the technology the application is written in doesn’t have any more workers in the workforce. An ad placed in the local newspaper only yields C++ and vb.net programmers. There are no COBOL, IMS, and VSAM programmers to be had at any price.
Furthermore, the requirements that originally shaped the applications have changed significantly. If there are going to be changes made, you might as well update the requirements.
And last but not least, the budget for the applications has long been spent. The organization thought that when it finished the application, no more money would be required. Money is tight now and trying to get budget for repairing older applications is a very difficult thing to do.
Making the changes to an old application that processes transactions and correcting the way that data is calculated or gathered is not a trivial act. We know the data is bad. But trying to repair the data at the point of collection is something that is uniformly unpopular.
What’s an organization to do? Well, what most people do is make their corrections in ETL processing. Admittedly, all errors cannot be fixed here, but a lot of them can be fixed here. The data is going to have to be passed by the ETL process, so it is not an issue of efficiency of processing. It is more a matter of being able to spot the error and to calculate the data correctly.
If there is any doubt as to the resistance to making changes to the source systems, look at what happened during Y2K. During Y2K, massive numbers of companies turned to ERP software instead of going back to correct the old legacy code. And the irony is that the Y2K issue was not even a terribly difficult issue to deal with.
You can argue for data quality all you want, but actually implementing data quality at the source level is another story altogether.