This article originally appeared on the BeyeNETWORK.
Bad data in the data warehouse has been given an unfair and nefarious reputation. People are up in arms about bad data in a data warehouse when the data is not bad at all.
Let’s examine what bad data is perceived to be in a data warehouse. An end user goes to the data warehouse and sees some data. The end user then remembers a report from an operational system with a different set of numbers. The end user then proclaims that the data warehouse is not credible and should not be used for decision making.
This is what happens when a data warehouse is perceived to have bad data. But is the data really bad? There are actually some very good reasons why the data may appear bad when it is really not bad at all.
For instance, the data may appear to be bad if an end user looks at the data while it is in abeyance (waiting to be updated). There often are circumstances where two files must be merged as input into the ETL process. One file is ready at 9:00 a.m. and the other file is not ready until 6:30 p.m., yet both files must be merged. The file that is ready at 9:00 a.m. waits all day. Thus, it is true that the operational system has different numbers than the data warehouse for a period of time, but that doesn’t mean that the data warehouse has bad information.
Data may also appear to be bad if it has been integrated improperly. For instance, data in the operational environment is represented in inches, and data in the data warehouse is represented in centimeters. (Ask Martin Marietta about the Hubble scope and this error.) While it is true that data has appeared in the data warehouse that is incorrect, the problem is not with the data but with the transformation specification. Someone did not catch the transition from inches to centimeters.
But there is a really common case that arises that is not like these cases at all. This is especially true for people building a finance data warehouse. (There must be a course in college that all people in finance take that causes this phenomenon.)
One day, the financial analyst walks in and announces that the data warehouse has bad data. Accordingly, there is no credibility for the data found in the data warehouse. The financial analyst goes on to say – “I looked at one of my operational reports today, then I went and looked at the data warehouse. The data is not the same, so the data warehouse is not to be trusted.”
Does this mean that the data warehouse has incorrect or bad data? Not at all. The data in the data warehouse is corporate data, not application data.
In order to explain corporate data, consider this very normal scenario. There are three applications that feed data into the data warehouse. One application has a closing date of the end of the month. Another application has a closing date of the 28th of the month. And yet a third application has a corporate calendar closing date.
When data is put in the data warehouse, there can be only one corporate closing date. For the purposes of our example, we will choose the end of the month. So what happens when a transaction comes into the data warehouse on the 29th of the month? In the application system, the transaction closes in one month; in the corporate data warehouse, the transaction closes in the next month. Of course the results are different from the application data and the corporate data. Thus, the data in the data warehouse is not incorrect at all, it is simply corporate.
As it turns out, there are numerous reasons why data in an application can differ from data in a data warehouse. Another reason is a difference in chart of accounts for the application and the data warehouse. When data (transactions) is accounted for in the application, it is classified one way. In the data warehouse, it is classified another way. No wonder the reports coming from the data warehouse don’t look like the reports coming from the applications. Just because the reports are different does not mean they are wrong. In fact, the data warehouse has become the authoritative source of information for the corporation.
There is the case where the data really is incorrect in the data warehouse. In this case, a transaction has an amount of $10,823 when – somehow – the value should be $8,276. At last, we have some bad data – real bad data – in the data warehouse. It is noted that this form of bad data is quite different from all the other forms of bad data in the data warehouse.
Merely looking at data in the data warehouse does not say anything about whether the data really is incorrect.