Buyer’s Guide: Choosing data quality tools and software

All the reporting tools in the world mean nothing if you’re using dirty data. Learn how to select the right data quality tools and management software for your organization.

Buyer’s Guide: Choosing data quality tools and softwareAll the reporting tools in the world mean nothing if you’re using dirty data. In this buyer's guide, learn how to weed through the various data quality tool vendors and get tips on selecting the right data quality management software for your business. Also, read about topics such as the pros and cons of open source data quality software. This opening section provides a detailed overview of the issues to consider in evaluating and choosing data quality tools.


Table of Contents

Buyer’s Guide: Choosing data quality tools and software


As more business analysts recognize the relationship between high quality data and the success of the business, there is a growing interest in integrating data quality management within the organization. And while the lion’s share of the effort involves putting good data management practices in place and establishing data governance, there will always be a requirement for technology to support data quality maturity.

That being said, until relatively recently, most people equated the phrases “data quality” and “data cleansing” with the expectation that data quality tools were intended only to help identify data errors and then correct those errors. In reality, there are many techniques applied within the context of a data quality management program, with different types of tools used to support those techniques.

Data quality management incorporates a “virtuous cycle,” shown in FIGURE 1, essentially consisting of two phases: analysis and assessment, followed by monitoring and improvement.

Data quality tools and technology are necessary to support both the analysis and assessment phases. Data profiling tools are used for data analysis and identification of potential anomalies, while parsing and standardization tools are employed for recognizing errors, normalizing representations and values, and some degree of data correction. These tools can be used to define data quality rules that assert validity of data and are used to flag non-conformant values and to aid the correction process.

A commonly used technology for customer and business name correction is identity resolution, which helps in linking and resolving variant representations of

the same entities. After normalization and identity resolution have been performed, data enhancements such as address standardization and enhancement and geocoding are applied. Lastly, the data quality rules can be integrated within a data quality auditing tool that measures compliance with defined data quality expectations. The results of these measurements can be fed into a data quality metrics scorecard, and if the metrics are defined in relation to the business impacts incurred by violating the expectations, that scorecard will provide an accurate gauge of how improving data quality goes straight to the bottom line.

Data profiling
The initial attempts to evaluate data quality are processes of analysis and discovery, and this analysis is predicated on an objective review of the actual data values. The values populating the data sets under review are assessed through quantitative measures and analyst review. While a data analyst may not necessarily be able to pinpoint all instances of flawed data, the ability to document situations where there may be anomalies provides a means to communicate these instances with subject matter experts whose business knowledge can confirm the existence of data problems.

Data profiling is a set of algorithms for statistical analysis and assessment of the quality of data values within a data set, as well as exploring relationships that exist between value collections both within and across data sets. For each column in a table, a data profiling tool will provide a frequency distribution of the different values, providing insight into the type and use of each column. Cross-column analysis can expose embedded value dependencies, while inter-table analysis explores overlapping value sets that may represent foreign key relationships between entities, and it is in this way that profiling can be used for anomaly analysis and assessment.

The data profiling process often sheds light on business rules inherent to each business process’s use of the data. These rules can be documented and used during the auditing and monitoring activity to measure validity of data.

Data parsing and standardization
In any data set, slight variations in representation of data values easily lead to situations of confusion or ambiguity for both individuals and other applications. For example, consider these different character strings:

  • 301-754-6350
  • (301) 754-6350
  • 301.754.6350
  • 866-BIZRULE

All of these formats use digits, some have alphabetic characters, and all use different special characters for separation, but to the human eye they are all recognized as reasonable telephone number formats. To determine whether these numbers are accurate or to investigate whether duplicate telephone numbers exist, the values must be parsed into their component segments (area code, exchange and line) and then transformed into a standard format.

When analysts are able to describe the different component and format patterns used to represent a data object (person's name, product description, etc.), data quality tools can parse data values that conform to any of those patterns and even transform them into a single, standardized form that feeds the assessment, matching and cleansing processes. Pattern-based parsing can automate the recognition and subsequent standardization of meaningful value components.

In general, parsing uses defined patterns managed within a rules engine used to distinguish between valid and invalid data values. When patterns are recognized, other rules and actions can be triggered, either to standardize the representation (presuming a valid representation) or to correct the values (should known errors be identified).

Identify Resolution:
Similarity, linkage and matching

A common requirement for data quality management involves two sides of the same coin: when multiple data instances actually refer to the same real-world entity, as opposed to the perception that a record does not exist for a real-world entity when in fact it really does. Both of these problems indicate the need for techniques to help identify approximate matches to determine similarity between different records. In the first situation, similar, yet slightly variant representations in data values may have been inadvertently introduced into the system, while in the second situation, a slight variation in representation prevents the identification of an exact match of the existing record in the data set.

More on data quality tools

Find out the results of the Garner Magic Quadrant for data quality tools

Learn why business processes can be more important to the data quality process

Read about the need for improved business data quality measures

Both of these issues are addressed through a process called identity resolution, in which the degree of similarity between any two records is scored, most often based on weighted approximate matching between a set of attribute values between the two records. If the score is above a specific threshold, the two records are deemed to be a match and are presented to the end client as most likely to represent the same entity. Identity resolution is used to recognize when only slight variations suggest that different records are connected and where values may be cleansed.

Attempting to compare each record against all the others to provide a similarity score is not only ambitious but also time-consuming and computationally intensive. Most data quality tool suites use advanced algorithms for blocking records that are most likely to contain matches into smaller sets, whereupon different approaches are taken to measure similarity. In addition, there are different approaches to matching—a deterministic approach relies on a broad knowledge base for matching, while probabilistic approaches employ statistical techniques to contribute to the similarity scoring process. Identifying similar records within the same data set probably means that the records are most likely duplicated and may be subjected to cleansing and/or elimination. Identifying similar records in different sets may indicate a link across the data sets, which helps facilitate cleansing, knowledge discovery, reverse engineering and master data aggregation.

Data cleansing and enhancement
Data cleansing incorporates techniques such as data imputation, address correction, elimination of extraneous data, and duplicate elimination, as well as pattern-based transformations. Data cleansing complements (and relies on) parsing and standardization as well as identity resolution and record linkage. Data enhancement is a data improvement process that relies on record linkage, along with value-added improvement from third-party data sets (such as address correction, geo-demographic/psychographic imports, list appends). This is often performed by partnering with data providers, using their aggregated data as a “source of truth” against which records are matched and then enhanced.

Data auditing and monitoring
The same types of data quality rules exposed through conversations with subject matter experts and profiling can be used to describe end-user data quality expectations. Monitoring defined data quality rules and auditing the results provides a proactive assessment of compliance with expectations, and the results of these audits can feed data quality metrics populating management dashboards and scorecards.

Data profiling tools, as well as standalone auditing utilities, often provide capabilities for proactively validating data against a set of defined (or discovered) business rules. In this way, the analysts can distinguish those records that conform to defined data quality expectations from those that don’t, which in turn can contribute to baseline measurements and ongoing auditing for data quality reporting.

What to look for in data quality tools and vendors
There are two interesting notions to keep in mind when considering data quality tools. First, every organization’s needs are different, depending on the type of company, industry and business processes and their corresponding dependence on the use of high-quality data. Second, while the needs may be different, the ways that those can be addressed are often very similar, although different vendors may address those issues with greater degrees of accuracy and precision (and, naturally, cost).Weighing both of these notions together, the conclusion is that what will distinguish the suitability of one product over the others is more than just functionality, especially as data quality technology becomes more of a commodity capability.

Along with functionality, consider cost, installed base, vendor stability, training and support capabilities, as well as the pool of talent that can be tapped to help integrate the tools within a governed data quality program. In addition, because there have been many corporate acquisitions within the data quality market, consider whether or not a specific tool offering necessitates purchasing a full-blown suite of products. Alternatively, one must consider the level of comfort of purchasing one component of a vendor’s tools suite with the expectation that it will integrate well with other tools already in use within the environment.

Business needs assessment and data quality tools requirements
The desire to acquire data quality tools should be tempered by the assessment process—too often, the technology is purchased long before the specific business needs have been determined. Therefore, it is worthwhile to perform a high-level data quality assessment with these specific objectives:

  • Identify business processes that are affected by data quality issues.
  • Identify the data elements that are critical to the successful execution of those business processes.
  • Evaluate the types of errors and data flaws that can occur.
  • Quantify business impacts associated with each of those errors.
  • Prioritize issues based on potential business impacts.
  • Consider the data quality improvements that can be applied to alleviate the business impacts.

While this process is presented simply above, there are many subtleties that may require additional expertise. To expedite this assessment, you may consider partnering with expert consultants that can perform the rapid assessment while simultaneously training your staff to replicate the process on other data sets. General requirements for data quality tools

As a result of this process, companies should arrive at a prioritized list of improvements, which should frame the discussion of requirements analysis, both from the data quality standpoint and from the systemic and environmental aspects. For example, determining that duplicated customer records lead to business risks would suggest that duplicate elimination is advisable. This requires tools for determining when there are duplicates (identity resolution) and for cleansing (parsing and standardization, enhancement).

There are also degrees of precision, however. If your company is a mail-order vendor sending out duplicate catalogs, the risk is increased costs and lowered response, but 100% de-duplication may not be a requirement. But attempting to identify terrorists at the airport security gate may pose a significant risk in terms of passenger safety, necessitating much greater precision in identity resolution. Increased precision is likely to correspond to increased costs, and this is another consideration.

In terms of environmental and systemic aspects, one must consider how well different products can integrate within the organization’s system architecture. Hard requirements such as platform compliance are relatively easy to specify. Architectural expectations are too, such as the different deployment options such as whether the tools support real-time operations, whether they only execute in batch, or can they be integrated “in-line” are also relevant questions. In addition, as more organizations migrate toward services-oriented architectures (SOAs), determining whether the tools support services also becomes a requirement. From the business side, one must consider the license, support, training, and ongoing maintenance costs, as well as the internal staffing requirements to manage and use the products.

Because many vendors provide tools that may (or may not) address the organization’s issues, it is worthwhile to carefully delineate your business needs and technology requirements within a request for proposal (RFP). Providing an RFP provides two clear benefits. First, it clarifies your needs to the vendors so they can more effectively determine if their product will meet them. Second, it provides a framework for comparing vendor tools, scoring their relative suitability and narrowing the field. It is a good idea to include specific metrics associated with the quality of the data that can be used to compare and measure effectiveness of the products.

Narrowing data quality vendors
Reviewing the RFP responses will help filter out those vendors that make the grade from those whose products are not entirely appropriate to address the business needs. But to narrow the remaining vendors to a short list, set up meetings for the vendors to present their technology along with a proposal for how their products will be used to address the business needs. Again, it may be worthwhile to engage individuals with experience in data quality tools and techniques to clarify the distinctions among the vendor products, translate any “tech-talk” into terms that are understood, and to ask the tough questions to ensure that the vendors are properly representing what their products can and cannot do.

By this time, your team should be able to whittle down the field to at most three competitors. The final test is to try out the tools yourself—arrange for the installation of an evaluation version of the product and run it over your own data sets. Having specified a benchmark data set for comparison, one can compare not just how well the products perform but also the ease of use and adoption by internal staff.

Specific requirements
The full details of what can be expected from the data quality tools described is beyond the scope of this article, but this table provides some high-level, “no-miss” capabilities for each of the tools described.

Technology Core capabilities
Data profiling
  • Column value frequency analysis and related statistics (number of distinct values, null counts, maximum, minimum, mean, standard deviation)
  • Table structure analysis
  • Cross-table redundancy analysis
  • Data mapping analysis
  • Metadata capture
  • DDL generation
  • Business rule documentation and validation
Parsing and standardization
  • Flexible definition of patterns and rules for parsing]
  • Flexible definition of rules for transformation
  • Knowledge base of known patterns
  • Ability to support multiple data concepts (individual, business, product, etc.)
  • Manageable transformation actions
Identity resolution
  • Entity identification
  • Record matching
  • Record linkage
  • Record merging and consolidation
  • Flexible definition of business rules
  • Knowledge base of rules and patterns
  • Integration with parsing and standardization tools
  • Advanced algorithms for deterministic or probabilistic matching
Cleansing and enhancement
  •  Flexible definition of cleansing rules
  • Knowledge base of common patterns (for cleansing)
  • Knowledge base of enhancements (e.g., address cleansing, geocoding)
Auditing and monitoring
  • Data validation
  • Data controls
  • Services-oriented
  • Rule management
  • Rule-based monitoring
  • Reporting

 

Conclusion
Even though many of the more established data quality tool vendors have been acquired by even bigger fish, there are still companies emerging with better approaches to fill the void. Whether better algorithms packaged in a different way, improvements in performance, better suitability to SOA, or even an open source offering, there is a wide range of vendors, products and tools to fit almost any organization’s needs. Even armed with the knowledge of what you should look for in data quality tools, there is one last caveat: If your organization has an opportunity for data quality improvement, make sure that you have done your homework in business needs assessment and development of a reasonable RFP before evaluating and purchasing tools.


About the author: David Loshin, president of Knowledge Integrity, Inc, is a recognized thought leader and expert consultant in the areas of data governance, data quality methods, tools, and techniques, master data management, and business intelligence. David is a prolific author regarding BI best practices, either via his B-Eye Network expert channel and numerous books on BI and data quality. His book, Master Data Management, has been endorsed by data management industry leaders, and his valuable MDM insights can be reviewed at www.mdmbook.com. David can be reached at loshin@knowledge-integrity.com , or 301-754-6350.


Table of Contents

Buyer’s Guide: Choosing data quality tools and software


 

Dig deeper on Data quality management software

Pro+

Features

Enjoy the benefits of Pro+ membership, learn more and join.

0 comments

Oldest 

Forgot Password?

No problem! Submit your e-mail address below. We'll send you an email containing your password.

Your password has been sent to:

SearchBusinessAnalytics

SearchAWS

SearchContentManagement

SearchOracle

SearchSAP

SearchSOA

SearchSQLServer

Close