This article originally appeared on the BeyeNETWORK.
Data quality must be an integral part of an enterprise. An enterprise runs under the assumption that the data contained inside it is accurate and valid. If the data is not valid, then there is no accountability for the decisions based upon it. It is necessary therefore, to assure that the data is of the highest quality.
The initial point of data entry is very important, because, if the data is wrong, the incorrectness of the data propagates throughout the system and is very difficult to find. For example, if the person enters “Bill K Inmon,” a great deal of work is required to determine that the entry should have been “Bill H Inmon.” The more automated the entry of data, the better.
The second place of importance for data quality is as the data enters the data warehouse. Data quality is important in the data warehouse because many forms of DSS processing rely on the data found in the data warehouse. If the data enters the data warehouse incorrectly, that error proliferates everywhere the data is used such as, in a data mart, in an exploration warehouse, etc.
The third place where data quality needs to be addressed proactively is in the data warehouse itself. Even after data is loaded into the data warehouse, the quality of the data needs to be occasionally verified and perhaps even modified.
Looking at the entity in its entirety, there are then three places where data quality is best addressed on a proactive basis:
- the point of capture,
- ETL integration process, and
- The data warehouse.
Point of Capture
There are many ways to address data quality as data is first entered. Some of the ways are:
- to enter as much data as possible in an automated manner, through credit cards, airline tickets, pre-established user profiles, etc. This absolutely minimizes the need for human interaction,
- if data must be entered manually, it should be entered from pre-set options, if at all possible,
- if data must be entered manually, and pre-set options cannot be used to any great extent, use trained operators,
- to make the interaction with the operator or end-user as straightforward as possible; to be consistent from one screen to the next, use the same conventions, the same codes, and the same sequence of interactions, etc.
These are just a few of the ways that data will be entered into the system in as efficient and accurate a manner as possible. Of course, for non-transaction based systems, data must be entered in the way in which the data can be captured.
In addition to taking data quality seriously at the point of entry, the data needs to be edited immediately. When questionable data is entered into the system, there should be immediate feedback to the operator questioning the validity of the data. There can be feedback about such things as:
- “you have entered “T” for gender. “M” or “F” is more appropriate”
- “an income of $1,000,000 is questionable. Are you sure?”
- “you have entered a first name of “Sue” and a gender of “M.” Is this correct?”
- “you have entered a last name of “Venkatakrishnanamen.” This is a long name. Is this correct?”
There is no need to reject what has been entered. Instead, a confirmation question is in order to make sure that the entry is proper.
ETL Integration Process
There are several reasons why data quality is important here. In many cases, the older collection applications are beyond the point of amendment. Organizations are simply incapable of repairing older applications. When this occurs, the moment that data passes from the collection applications would be the first time that the data can be manipulated. Since the data is being pulled from the collection applications during the ETL/integration process, it is convenient to check the data quality at this point.
Another important reason why data quality assurance is best done during the ETL/integration process is that it is the first time when enterprise-wide integration occurs.
The data collection process concerns the initial entry of data. The ETL/integration process concerns the issues of integration. These are an entirely different set of issues. Therefore, if data must be examined in any case, it simply is easiest to do the bulk of the examination at the point of ETL processing.
The types of data quality activities that occur here are:
- reformatting data into a common format;
- summarizing data;
- restructuring keys into a common format;
- restructuring records of data into a common format;
- adjusting reference data, etc.
ETL processing can be done in an automated or manual manner. If there are a large number of programs that must be built and maintained, then it makes sense to do the ETL process with automation. Only if there are very few collection applications that have to be managed, does it make sense to build and maintain the ETL interface manually.
There is one drawback to data quality at the point of ETL/integration and that is, it is only possible to look at the data inside the record being analyzed.
The Data Warehouse
The third place where data quality needs to be addressed proactively is in the data warehouse itself.
Even after data is loaded into the data warehouse the quality of the data needs to be occasionally verified and perhaps even modified. The data warehouse contains information that can change over time. Zip codes change, area codes change, addresses change, names change, etc. If an entry is made in the data warehouse in January, 1995, using the accounting code “ACD”, the data is just fine as entered. But as time passes the corporation adopts a new chart of accounts. In March, 2002, the accounting code “ACD” has been changed to “KHY”. There is no way for an analytical program to know that a code one day is a different code the next. Analytical programs are simply not that sophisticated. Therefore, to avoid confusion, a way to validate the data found in the data warehouse must be identified and implemented.
The data that is loaded into the data warehouse must be monitored after it is loaded. Examples of what the data warehouse data quality monitors look for are:
- “is this zip code out of date?”
- “is a value of $100,000 appropriate here?”
- “is there really someone named SMYUTHE?”
- “is the format for date – yyyymmdd – consistent?” etc.
Categories of Examination
There are several different categories of examination for data quality assurance that have their own distinctive characteristics. These are:
- single record;
- multiple records in the same file; and,
- multiple records in different files.
The first category is single record examination, where each record is handled one at a time. Such an instance occurs during typical ETL processing. ETL processing must be very conscious of resources used, because typically, there is a limited window of time that ETL processing can occur. Therefore, ETL processing must be aware of the efficiency of operation. The best way to ensure efficient operations is to pass records sequentially, and handle only one record at a time. In other words, as ETL processing is occurring, the ETL process cannot take the luxury of finding other records that relate to the record being examined. In such a fashion ETL processing is streamlined. But the complexity of the examination is limited. Some examples that can be processed one record at a time are:
- verifying the domain of a data element;
- verifying the range of a data element;
- verifying the relationship between different data elements within the same record;
- verifying the formatting of data elements.
These then are some of the simple audits and verifications that can be done while looking at one record at a time.
The second category of examination is one in which multiple records of the same file are audited and examined. The implication is that there is a single physical order that the records are in as they are passed. When the records are passed in the same physical sequence in which they are stored, processing is efficient. One I/O can be used to gather and analyze as many records as the block will hold. But when the records have no physical order, or when the records are passed sequentially, in an order other than the one in which they are physically stored, then an extraordinary amount of I/O may result.
Nevertheless, when the records have a physical sequence, it is possible to pass them in that sequence and still perform some data quality assurance activities. Data quality assurance activities include verifying that summarization is done properly. An example of this kind of processing might be reading the activities that have occurred during the week, there are two transactions for Monday. There are three transactions for Tuesday. There are ten transactions for Wednesday, and so forth. All counters are set to zero for the first of the week. As the weekly records are read, the counters are tallied with information gained from reading the record. Finally at the end of the week, a weekly record is written and the daily totals are compared with those totals found in the weekly record. If the weekly totals are not correct, the proper values are entered and a message that an error has occurred is written.
Another type of data quality assurance auditing involves multiple record types being edited and audited together. There are several limitations to this kind of audit. The primary limitation is that not many records can be audited at once, unless the files are small and machine resources are abundant.
The advantage of a multi-file audit is that practically anything that needs to be checked can be checked. There are very few other limitations to this style of audit, other than the one previously discussed.
Other data quality examinations that involve files of different types are:
- Cross-file data quality verification and auditing;
- Verifying the relationship between data residing in two files;
- Verifying to see that the data in one file equals the data in another file scattered over multiple records;
- Verifying cross-file referential integrity.
These are the ways that quality assurance auditing and examination programs can occur.
An Automated Approach is Preferable
There are many reasons why automation is preferred over a manual approach. The types of files that are being audited and examined are simply too large for a manual approach. In addition, consistency and accuracy are the major reasons why a manual approach to auditing and examination is not recommended.
So what happens when bad or questionable data is encountered? There are several actions that can be taken depending on:
- the data;
- the severity of the error;
- whether the data can be corrected in any case;
- the volume of errors;
- the frequency with which they occur, etc.
One alternative upon finding an error is simply to correct the error. This solution implies that the correct answer is known and that the data can be corrected on the spot. However, in many cases an error can be spotted, but the correction is not readily apparent. The next alternative is to simply to insert a default value upon finding an error. The default value may not be the proper correction, but is better than allowing incorrect data to populate the system. Another alternative to correcting data or taking action upon finding incorrect data during an audit is to simply delete the record that has faulty data. While this approach may indeed purge the system of incorrect data, it also purges the system of valid data. A third alternative is to simply flag the faulty record, rather than delete it. In this way the user of the data is alerted that there is a problem. And at the same time, the designer has not thrown away perfectly valid and otherwise useful data. And finally, when an error has occurred and is detected a separate report can be created. Inaccurate data is left in the file, but, it has been detected and an appropriate alert has been generated.
Bill is universally recognized as the father of the data warehouse. He has more than 36 years of database technology management experience and data warehouse design expertise. He has published more than 40 books and 1,000 articles on data warehousing and data management, and his books have been translated into nine languages. He is known globally for his data warehouse development seminars and has been a keynote speaker for many major computing associations. Bill can be reached at 303-681-6772.
Editor's Note: More articles, resources and events are available in Bill's BeyeNETWORK Expert Channel. Be sure to visit today!