News Stay informed about the latest enterprise technology news and product updates.

# Accuracy, Sampling and the Accuracy of Sampling

## Manually verifying data accuracy is bound to introduce errors. Therefore, sampling can be used to derive some conclusions about a large body of information by selecting a smaller number of instances for the purposes of inferencing.

There are a number of data quality dimensions that are used to define metrics for quantifying levels of data quality, and many of these measurements are easily automated. However, I am often faced with the challenge of measuring the dimension of data accuracy – the determination of whether the managed information accurately reflects the real-world items they are intended to model. The issue is that it is often impossible to automate the determination of accuracy. We can create an automated process to compare the values in one data set against the values in a corresponding “system of record.” This kind of comparison can measure consistency across the two data sets, but does not ensure absolute correctness because there is no guarantee that the system of record is accurate either.

The alternative to automation is a manual process – perhaps the best way to check accuracy is to actually have a real person follow up by evaluating the data associated with the real-world object and verify the object’s attributes against the ones in the model. If the data set is small, the accuracy challenge is burdensome, but not necessarily impossible. However, if the data set is large, the prospect of manually verifying data accuracy is not only distasteful, it is likely to be so monotonous that the repetitive activity is bound to introduce errors in verification along the way, which of course will ruin the entire expected output.

One approach to remedying this apparently Sisyphean task is to adopt a strategy long used by statisticians: sampling. The objective of sampling is to derive some conclusions about a large body of information by selecting a smaller number of instances (or “observations” in statistics lingo) for the purposes of inferencing. Of course, the sampled data instances should be representative of the entire set, since we anticipate using the results to draw conclusions about the entire data set. So how do we isolate a representative sample from which we can effectively draw reasonable implications regarding the accuracy of the data set as a whole?

The question really is: How do we perform our sampling in a way that the extracted sample is small enough to warrant the manual process yet provides enough confidence in the results? The first idea is to employ random sampling – randomly selecting data instances from the set to ensure that there were no dependencies associated with the way that the data items were selected. Next, consider that the key phrase here is “confidence,” and that is reflected in the terminology used in calculating sample size. To determine the size of the sample, the analyst must determine the confidence level and confidence interval associated with the expected results.

The confidence level characterizes “how sure one is” about the results – it characterizes the probability that the result lies within a margin of error; typical confidence levels are often 95% and 99%. The confidence interval is the “margin of error” associated with results. For example, if the confidence level is 95% and the confidence interval is 2, then a measurement of 86 means that you are 95% sure that the measure is between 84 and 88.The third variable potentially affecting sample size is the population size, but only if the population is small. To calculate the sample size, we can use this expression:

Sample size = (Z2 * p * (1-p))/ C2

In this expression, Z is the z-value, which is intended to provide a multiplier to incorporate the percentage of the population within that number of standard deviations associated with the confidence level (1.96 for 95%, 2.58 for 99%), p is the estimated percentage of the population picking a choice (for our purposes, .5 or 50%, representing the choice between “accurate” and “inaccurate”), and C is the confidence interval expressed as a decimal. So for a 95% confidence level and +/-4 confidence interval, the sample size would be approximately 600. This is the same thing as saying that with a truly random extract of 600 records for accuracy validation, a result of 90% accuracy means that with a 95% probability, the percentage of accurate records is between 86% and 94%. For a 99% confidence level (z value is 2.58) and a +/-2 margin of error, the sample size would be 4160, whether you had 100,000 or 1,000,000 records.

Given a truly random selection process, then, for any large data set, we could select 4,160 records for evaluation of accuracy and have a 99% level of confidence that our answer would be within a +/-2 margin of error. While 4,160 is not a small number, it is certainly more reasonable than looking at all of the values, and provides enough precision to determine the magnitude of any accuracy issues. Five staff members could verify that number of records in a week by each making 21 telephone calls an hour.

Is this the optimal approach to measuring accuracy? In the absence of a true source of record, the answer is probably “yes.” But understanding something about the arithmetic of sample sizes does provide an additional level of confidence that the assessment measurement is actually meaningful for the entire data set.

David Loshin
David is the President of Knowledge Integrity, Inc., a consulting and development company focusing on customized information management solutions including information quality solutions consulting, information quality training and business rules solutions. Loshin is the author of The Practitioner's Guide to Data Quality Improvement , Master Data Management, Enterprise Knowledge Management:The Data Quality Approach  and Business Intelligence: The Savvy Manager's Guide . He is a frequent speaker on maximizing the value of information. David can be reached at loshin@knowledge-integrity.com or at (301) 754-6350.

## Content

Find more PRO+ content and other member only offers, here.

#### Start the conversation

Send me notifications when other members comment.

• ### Predictive analytics projects can bolster business decisions

Blind faith in predictive models can result in flawed business decisions. Analytics teams need to manage predictive processes ...

• ### How predictive analytics techniques and processes work

Predictive analytics is no longer confined to data scientists and other highly skilled analysts. But other users need to ...

• ### Hyper engine aims to give enterprise Tableau analytics a boost

The Hyper in-memory data engine added in Tableau 10.5 could make Tableau analytics capabilities more expansive, enabling the ...

## SearchAWS

• ### AWS Auto Scaling simplifies management across services

Amazon's Auto Scaling service can reduce duplicate efforts by users. It's also the latest example of Amazon’s effort to ...

• ### Three Amazon AI-based projects to get your dev team rolling

Amazon AI technologies expand capabilities for enterprise developers. Try out these sample AI projects to familiarize yourself ...

• ### AWS-VMware partnership remains a win-win -- for now

When AWS and VMware partnered up, it opened new revenue streams for each company. But does the deal mark a one-off collaboration ...

## SearchContentManagement

• ### Intelligent information management the next wave for ECM

In a 2018 upgrade, M-Files allows users to search for content in multiple repositories, while also being able to automatically ...

• ### SharePoint integration and implementation best practices

Here are some expert advice and tips, as well common definitions, to help make your SharePoint integration and implementation a ...

• ### SharePoint branding capabilities get a facelift

Since Microsoft Ignite last September, SharePoint Online is getting new branding capabilities that have been on the wish lists of...

## SearchOracle

• ### Using Oracle 12c Unified Auditing to set database audit policies

Oracle Database 12c's built-in Unified Auditing feature streamlines the database auditing process, including creation and ...

• ### Top Oracle tips and tricks of 2017 you won't want to forget

We've rounded up five of the most notable tip articles we published in 2017, with advice that can help make Oracle projects ...

• ### Big Data Cloud Service streamlines Oracle Hadoop deployments

As part of its Big Data Cloud Service, Oracle provides a set of internal and external tools designed to help users efficiently ...

## SearchSAP

• ### SAP S/4HANA Cloud and indirect access will dominate 2018

Industry experts say SAP S/4HANA Cloud migrations, Leonardo and Cloud Platform are the technology issues for SAP in 2018; on the ...

When a Dutch energy grid provider needed to develop new business apps on top of SAP ERP, it turned to the Mendix RAD platform to ...

• ### SAP's Timo Elliott on enterprise chatbot AI technology

The SAP global innovation evangelist expects AI to affect businesses in three ways: human-computer interaction, automation of ...

## SearchSQLServer

• ### Meltdown and Spectre fixes eyed for SQL Server performance issues

Microsoft has responded to the Spectre and Meltdown chip vulnerabilities with patches and other fixes. But IT teams need to sort ...

• ### Five SQL Server maintenance steps you should take -- ASAP

Putting off SQL Server administration tasks can lead to database problems. Enact these often-neglected maintenance items to help ...

• ### Microsoft Cosmos DB takes Azure databases to a higher level

Azure Cosmos DB brings a new element to the database lineup of Microsoft's cloud platform, offering multiple data models and a ...

Close