Data Cleansing

The hard reality is that always problematic and often costly data anomalies do exist. Will Dwinnell explains why it is helpful to have a tool to automatically ferret out a substantial fraction of those anomalies.

This article originally appeared on the BeyeNETWORK.

An IT department at a major industrial firm maintains a small reporting environment for a service group. The reporting system accesses a very small portion of the operational database: only 4 tables. Database joins among those tables, which are a fundamental component of this reporting environment, were created incorrectly meaning that data requested from this system will be incorrect. This mistake was not discovered for 3 years.

The human resources database at a large utility has records for employees who are listed as being extraordinarily young or old. Many employees are listed as being under 4 years of age and one is listed as being several hundred years old.

A customer database at a well-known bank used a single field for the client's name. Unfortunately, no data entry rules were enforced, so "Sarah Connor" might also appear as "Connor, Sarah" and "S. Connor", leading to duplicate records and search problems. This problem has never been fixed.
Every one of these stories is true. The sad fact is that databases in real organizations – large, profit-oriented organizations – are full of holes. Data is missing, duplicated, dirty or just plain wrong. There are many reasons for this, from the benign (data entry errors and misspelling) to the malicious (embezzlement and fraud). These data irregularities are expensive. They make data recall slower and less accurate (and thus affect business decisions downstream). They make database programming and maintenance more difficult and costly. They hide accidental, inappropriate and even criminal activity. From the perspective of the business analyst, statistician and data miner, these errors represent noise which mask important (and potentially very valuable) patterns of behavior.
Whatever the cause of data quality issues, it is of great interest to detect and correct them. This task is called data cleansing or data auditing. Automated tools which assist with data cleansing are available from commercial suppliers. These tools utilize a variety of approaches to attack data quality problems. Chief among these is anomaly detection (also known as deviation detection), which is the discovery of "unusual" cases (errors or other cases of interest). Anomaly detection has been performed using a number of different technologies, including neural networks, statistics and artificial immune systems.
Anomaly detection (just like data compression or data mining) relies on the fact that most real data contains recognizable patterns. Patterns detected by computers in data can be summarized or modeled. Cases which deviate from discovered patterns are potential errors. While anomaly detection can be effective at discovering errors, notice the following important aspects of this approach:

First, patterns must be recognizable. If the pattern discovery scheme cannot find the kind of patterns which exist in the data set, it will not be able to identify possible errors. Different modeling systems, such as neural networks and rule-induction systems, for example, represent patterns differently and will function better when searching for patterns in different data sets.
Second, discovered patterns must be significant (see sidebar: Significance vs. Strength of Associations). If the data set is too small or too noisy to find significant patterns, anomaly detection will not be possible. A seemingly "strong" pattern might occur in only a few cases, and so may not be statistically stable enough to use. For example, it may be discovered that 9 out of 10 customers that buy cameras also buy film. The question of significance is: Is this literally 9 out of 10 customers, or is it 90% of thousands of samples? The number of examples, for example, affects our confidence in the discovered pattern, and thus the importance of cases which deviate from that pattern.
Third, discovered patterns must be (relatively) strong. Patterns which are statistically significant may still be too weak to allow one to discern errors. Deviations from weak patterns are, by definition, to be expected. During a recent data mining project for a client in the financial industry, I discovered that there was a statistically significant difference in the ages of two groups of customers. Unfortunately, this pattern was weak: while the difference was significant, it was also very small; furthermore, there was a substantial overlap in age between the two groups, making age a poor discriminator of group membership. Deviations from so weak a pattern would just as likely be due to normal variation as errors.
Fourth, errors which do not violate discovered patterns will not be detected. It is the nature of real-world anomalies that some of them will change the original data to something else which "looks" legitimate – in the data tampering, this may even be deliberate.
Last, not all deviations from strong, significant patterns will be true anomalies. Anomaly detection which depends on patterns in the data is fundamentally a heuristic process, and cannot always be right. Real data contains all sorts of complexity, and records flagged as possible anomalies should be treated with suspicion, not blindly discarded.


One automatic data cleansing tool is WizRule (current version: 3.3) from WizSoft. WizRule analyzes the data to find patterns (rules, which it reports) and deviations from those patterns. The software provides a three-pronged approach to anomaly detection: logical (if…then) rules, numeric rules and spelling rules.
Logical rules model data patterns using if…then statements, such as:

IF region IS "East"
AND car_type IS "sports car"
THEN salesman IS "Max Rockatansky"

Another example, using WizRule's "starts with" feature is:

IF product IS "wrench"
THEN product_code STARTS_WITH "HW57"

In this rule, wrenches are associated with any product code starting with "HW57", like "HW57701", "HW57999", etc.
Logical rules in WizRule can describe patterns in numeric variables as well as categorical ones:

IF substance IS "water"
AND state IS "liquid”
THEN fahrenheit_temp IS 32 to 212

WizRule also has a special facility for dealing with dates:

IF month IS "January"
AND country IS "Canada"
THEN ground_cover IS "white"

Numeric rules are formulas of moderate complexity which describe patterns in numeric data. Numeric rules in WizRule may take several functional forms, one being this linear form:

ice_cream_sales = 4.8 * temp + 10.3

There are several nonlinear forms as well, including hyperbolic, percentage, and an interesting one called “multiparameter” which can relate patterns in up to 4 variables.
Spelling rules delineate patterns in spelling. WizRule looks for values in text fields which are similar, where one is very frequent and the other is very infrequent (as though a name were occasionally mistyped). Deviations from spelling rules look like this:

The value "Mississippi" appeared 752 times in the State field. There are 3 cases containing similar values: records 31, 33 and 551.

Note that WizRule's spelling rules do not come from a spell checker; that is: there is no dictionary with correct spellings. WizRule simply relies on the data to provide "typical" spellings.
It is important to understand that WizRule is looking for rules which are diagnostic, not predictive. Although some of the discovered rules could be used for predictions, they cover all variables (as opposed to just a single target variable) and are intended to explain important patterns in the database, not provide optimal accuracy on every individual case. There are more variations on the rules than I have described here, but the WizRule documentation contains a thorough appendix which explains the rule forms and all of the calculations and measures performed by the software. WizRule provides many options for controlling or restricting the search for patterns so that its efforts may be focused on fields of interest.


Figure 1: This is a portion of a WizRule Rule Report, showing a few rules.
Note that the numbers after the rules are the record ID numbers of anomalies, not counts of records.


The most obvious application of this technology is in the detection and elimination of errors in the data: missing or incorrect information. However, any situation in which "unusual" data is of interest is a potential application of anomaly detection, such as auditing and fraud detection. Any of the three situations mentioned at the beginning of this article would have been detected quickly by this technology.
The showcase use of WizRule is at the MTA (NY Metropolitan Transportation Authority). The Office of the Inspector General at the MTA used WizRule to analyze organizational data. WizRule discovered a "salami slicing" scheme by an MTA manager. Salami slicing is an old computer criminal scheme whereby fractions of pennies lost to rounding are collected in the perpetrator's account.
WizRule has also been applied to fraud detection problems in other fields, such as the insurance industry, as well as simple data clean-up operations. One interesting case has to do with due diligence. One company was buying another and it was discovered that an important database was poorly documented. WizRule was able to reverse engineer the business rules in the database and prevent a potential maintenance disaster.


Data contains anomalies; that's the hard reality. Some of those anomalies are errors and some are fraudulent manipulations – most are very expensive too if not fixed. It helps to have a tool like WizRule to automatically ferret out a substantial fraction of those anomalies.

Figure 2: Significance vs. Strength of Associations.

Discovered associations between variables can be either strong or weak and either significant or insignificant. Whether relationships are strong and whether they are significant are separate issues. These 4 graphs show linear relationships between two variables. Note that all have identical dispersion from the lines.

Top-left: The ideal.
Top-right: Significant, but weak, so likely not useful ("business significant").
Bottom-left: With so few points, this relationship may or may not be real.
Bottom-right: We haven't found anything.

Dig Deeper on Data quality techniques and best practices