The mere perception of poor data quality can be enough to trigger actions intended to drive quality improvements...
– with those actions frequently materializing in the form of purchasing data quality tools. However, while buying and implementing data quality software may be seen within an organization as concrete evidence that progress is being made in fixing data quality problems, true progress often is delayed until the tools are properly put to use as part of a comprehensive set of data quality processes.
The purchase and use of data quality tools is justifiable in most cases, but the key to success isn't so much about acquiring a tool set as it is about getting the right tools at the right time to supplement best practices in data quality management. Business needs should always drive tool acquisition, and dovetailing good quality management procedures with the right kinds of tools for your organization will optimize your investments in data quality technology.
Fortuitously, a properly done data quality assessment yields tangible results that can help make the case for the incorporation of data quality tools into a quality improvement plan. The prioritized lists of data quality issues and potential fixes that the assessment process produces essentially define the business needs for data quality improvement and the steps that can be taken to correct the problems.
Remediation tasks that are approved and funded as part of a data quality program may include validation of data quality rules for completeness, consistency and reasonableness; data imputation (which attempts to “fill in” missing data); data standardization, correction and enhancement; and management and tracking of data issues. Some of those functions can be supported by data quality software, but they all rely on well-defined data quality processes and procedures being in place. The technology itself is not a magic bullet for fixing data quality issues – a fact that data quality teams should make clear to corporate and business executives when seeking approval for product purchases.
That doesn’t mean the available tools aren’t useful. Most data quality vendors provide some or all of the following capabilities:
Data profiling: Profiling software does statistical analysis of a data set’s values to identify potential errors and present them to business executives and users for review. A profiler can analyze the frequency distribution of values in a data column; it can also do cross-column analysis to look for inherent value dependencies and cross-table reviews to search for overlapping value sets. Data profiling tools provide a mechanism for concretely analyzing data quality across different dimensions, such as data completeness, validity and reasonability.
Parsing: Slight variations in the representation of data values may lead to nonconformance with predefined formats. Parsing tools can determine whether a value conforms to a recognizable pattern as part of the data quality assessment, matching and cleansing process.
Standardization: Pattern-based parsing supports automated recognition and standardization of critical data elements – for example, the names, addresses and telephone numbers of customers. Standardization maps data variations to known patterns, thereby normalizing data values and effectively cleansing data errors. Words can be changed to abbreviations, or vice versa; common misspellings can be corrected; entries can be translated from one language to another. In addition, standardization can help to improve identity resolution, data deduplication procedures and master data consolidation.
The key to success isn't so much about acquiring a tool set as it is about getting the right tools at the right time to supplement best practices in data quality management.
Identity resolution and matching: Identity resolution is a process in which the degree of similarity between different data records is scored, most often based on weighted approximate matching between a set of attribute values between the two records. If the score is above a predefined threshold, the two records are deemed to be a match and are identified as most likely representing the same individual or entity.
Data enhancement: Tools supporting data enhancement add information from third-party data sets, such as household lists and name standardization and demographic data reports, in an effort to further improve data quality. An example is address standardization and cleansing, which relies on parsing, standardization and the availability of third-party address information. Many industries benefit from data enhancement; for example, it can help reduce delivery costs, improve direct marketing response rates, boost customer profiling and segmentation efforts, and increase the accuracy of credit data used to assess and reduce financial risks.
But as mentioned above, data quality tools are only part of the answer to data quality issues. In order to make them an effective component of data quality improvement efforts, they need to be used to achieve specific business objectives under well-defined data quality management processes. That begins with the development of a metrics-based data quality business case and the data quality assessment, which typically involves the use of data profiling tools. A data quality program can then be extended to include processes such as data validation, remediation and monitoring as well as the ongoing management of data quality rules.
A data quality assessment should provide the justification that IT managers and data quality teams need to get approval and budget money for buying data quality tools. But devising a realistic and clear-eyed plan for integrating the tools into the data quality process will enable more effective use of the data quality software and avoid creating the impression that the technology itself will magically fix your data quality problems.
About the author: David Loshin is the president of Knowledge Integrity, Inc., a consulting company focusing on customized information management solutions in technology areas including data quality, business intelligence, metadata and data standards management. Loshin writes for numerous publications, including both SearchDataManagement.com and SearchBusinessAnalytics.com. He also develops and teaches courses for The Data Warehousing Institute and other organizations, and he is a regular speaker at industry events. In addition, he is the author of Enterprise Knowledge Management – The Data Quality Approach and Business Intelligence: The Savvy Manager's Guide. Loshin can be reached via his website: knowledge-integrity.com.