Automated record matching is a critical function of customer data integration (CDI) tools -- but this secret sauce
of business rules and complex algorithms can be difficult to evaluate.
That's because matching customer data across heterogeneous sources is a difficult problem to solve. "Matching" in CDI is used as a catch-all term for the process of comparing like records, eliminating duplicates, and combining them into the best version of a record, according to Jill Dyche and Evan Levy, co-authors of Customer Data Integration: Reaching a Single Version of the Truth (Wiley 2006) and partners and co-founders of Sherman Oaks, Calif.-based Baseline Consulting Inc.
"With CDI, name and address matching is really only the tip of the iceberg," Dyche said.
CDI matching engines look at a broad variety of attributes and use complex processes to match and link customer records. Not all matching engines are created equal, though. When evaluating CDI tools, here's how to assess matching engines.
Understand matching methods
Matching is a complex subject, but two types of matching methods are generally used by customer data integration systems, Dyche and Levy wrote in their recent CDI book.
- Deterministic matching compares data records using business rules and simple algorithms that catch such things as simple typos or phonetic differences like "f" versus "ph." So a deterministic system could be told that two records with different names are the same person if the Social Security number and address fields match. "Deterministic matching works best when comparing records where an exact match is anticipated," Dyche and Levy wrote.
- Probabilistic matching uses statistical algorithms and applies more fuzzy logic to matching records. These systems might be able to determine links between records with more complex typographical errors and might learn common error patterns. They assign a probability of a match expressed as a percentage -- such as an 80% probability of a match. Companies can generally set their own percentage thresholds for records that should be automatically matched and ones that should be reviewed. "This approach is usually preferred for large data sets," Dyche and Levy wrote, "or when many attributes are involved in matching."
But it's not a matter of picking one or the other -- or one being more accurate than the other. The matching style chosen for CDI has to do with an organization's goals, according to Scott Gidley, co-founder and chief technology officer for Cary, N.C.-based DataFlux Corp., a data quality and CDI tool vendor.
"It's more about processing performance -- how fast is something going to run based on how much value it's going to give you," Gidley said.
Deterministic matching is more cut and dry, and faster for processes such as real-time CDI hub lookups, Gidley said. But probabilistic matching allows for a higher level of variance between records and might catch more potential matches. Most companies that provide matching engines incorporate some variation of both options in their methodologies, he said.
And the end result of different CDI vendors' matching engines is often very similar, though it's a stretch to say that one algorithm will work for all companies, Levy said. The tool, and the matching engine, should be tested and evaluated against a company's unique requirements.
Here are some other things to keep in mind when assessing matching engines:
- Test with lots of real data; compare and benchmark results. Organizations testing CDI hubs should test with as much data as possible to get a more real-world test of the system, Levy said. DataFlux's Gidley also recommends profiling data sources to assess the accuracy and completeness of data. This can help with tuning the matching engine's business rules. For example, if 50% of the phone number fields in a database are empty, it might not be a good matching attribute.
- Consider the interface. When a system isn't sure whether records match, it generally refers the matter to a human data steward to make the call. So the interface for data stewards is an important part of a tool decision, Gidley said.
- Think globally and futuristically. Language differences, industry-specific requirements, and future infrastructure plans -- such as moving to a service-oriented architecture -- can also affect matching engine choices, according to Dyche.
Overall, Dyche said, the matching engine should be only a part of the overall CDI tool choice.
"Matching is one of many decisions to make," Dyche said. "When we see these vendor bake-offs and companies get down to [which CDI vendor] has the most accurate match, that's still only one component of the decision."
So, in some cases, the CDI tool with the most accurate matching engine might not be the final choice. Companies must consider the big picture, including data volumes, processing speed and functional requirements, Dyche said.