This article originally appeared on the BeyeNETWORK.
There are a number of data quality dimensions that are used to define metrics for quantifying levels of data quality, and many of these measurements are easily automated. However, I am often faced with the challenge of measuring the dimension of data accuracy – the determination of whether the managed information accurately reflects the real-world items they are intended to model. The issue is that it is often impossible to automate the determination of accuracy. We can create an automated process to compare the values in one data set against the values in a corresponding “system of record.” This kind of comparison can measure consistency across the two data sets, but does not ensure absolute correctness because there is no guarantee that the system of record is accurate either.
The alternative to automation is a manual process – perhaps the best way to check accuracy is to actually have a real person follow up by evaluating the data associated with the real-world object and verify the object’s attributes against the ones in the model. If the data set is small, the accuracy challenge is burdensome, but not necessarily impossible. However, if the data set is large, the prospect of manually verifying data accuracy is not only distasteful, it is likely to be so monotonous that the repetitive activity is bound to introduce errors in verification along the way, which of course will ruin the entire expected output.
One approach to remedying this apparently Sisyphean task is to adopt a strategy long used by statisticians: sampling. The objective of sampling is to derive some conclusions about a large body of information by selecting a smaller number of instances (or “observations” in statistics lingo) for the purposes of inferencing. Of course, the sampled data instances should be representative of the entire set, since we anticipate using the results to draw conclusions about the entire data set. So how do we isolate a representative sample from which we can effectively draw reasonable implications regarding the accuracy of the data set as a whole?
The question really is: How do we perform our sampling in a way that the extracted sample is small enough to warrant the manual process yet provides enough confidence in the results? The first idea is to employ random sampling – randomly selecting data instances from the set to ensure that there were no dependencies associated with the way that the data items were selected. Next, consider that the key phrase here is “confidence,” and that is reflected in the terminology used in calculating sample size. To determine the size of the sample, the analyst must determine the confidence level and confidence interval associated with the expected results.
The confidence level characterizes “how sure one is” about the results – it characterizes the probability that the result lies within a margin of error; typical confidence levels are often 95% and 99%. The confidence interval is the “margin of error” associated with results. For example, if the confidence level is 95% and the confidence interval is 2, then a measurement of 86 means that you are 95% sure that the measure is between 84 and 88.The third variable potentially affecting sample size is the population size, but only if the population is small. To calculate the sample size, we can use this expression:
Sample size = (Z2 * p * (1-p))/ C2
In this expression, Z is the z-value, which is intended to provide a multiplier to incorporate the percentage of the population within that number of standard deviations associated with the confidence level (1.96 for 95%, 2.58 for 99%), p is the estimated percentage of the population picking a choice (for our purposes, .5 or 50%, representing the choice between “accurate” and “inaccurate”), and C is the confidence interval expressed as a decimal. So for a 95% confidence level and +/-4 confidence interval, the sample size would be approximately 600. This is the same thing as saying that with a truly random extract of 600 records for accuracy validation, a result of 90% accuracy means that with a 95% probability, the percentage of accurate records is between 86% and 94%. For a 99% confidence level (z value is 2.58) and a +/-2 margin of error, the sample size would be 4160, whether you had 100,000 or 1,000,000 records.
Given a truly random selection process, then, for any large data set, we could select 4,160 records for evaluation of accuracy and have a 99% level of confidence that our answer would be within a +/-2 margin of error. While 4,160 is not a small number, it is certainly more reasonable than looking at all of the values, and provides enough precision to determine the magnitude of any accuracy issues. Five staff members could verify that number of records in a week by each making 21 telephone calls an hour.
Is this the optimal approach to measuring accuracy? In the absence of a true source of record, the answer is probably “yes.” But understanding something about the arithmetic of sample sizes does provide an additional level of confidence that the assessment measurement is actually meaningful for the entire data set.