Customer Data Integration (CDI) is a hot topic within Master Data Management (MDM). At its core, CDI relates customers in different systems thereby presenting a single view of the customer. CDI solves this integration dilemma by establishing customer identity. This article will describe the integration problem, walk through the steps and point out the challenges your likely to face when creating customer identity.
Describing the problem
Relating customers in multiple systems is certainly not a new problem. In fact, CDI became an issue for your company when customer data was stored in a second source system. Today, customer data may very well exist in dozens of siloed systems, each with its own version of the truth. Ah! Here lies the problem; a customer doesn't have a single system of record. So who has the official customer name, address or phone number? If you ask your system owners, they all do. Analysis is sure to tell us that the value, quality and age of this data are as varied as the number of ways it was originally captured. The "best" name may be found in one source system and the "best" phone number in another. Simply stated, we don't have a single view of the customer.
Choosing an integration model
So far, we've determined that the same customer data stored in different systems presents a consistency and fragmentation problem. To solve this, we'll need to integrate our customer data. We begin by choosing an integration model.
Figure 1: Point-to-point integration
There are two predominate integration models in this space, point-to-point and hub-and-spoke. Point-to-point integration (Figure 1) involves a direct connection between source systems. As you can imagine, this approach is difficult to manage and scale. On the other hand, a simpler hub-and-spoke model (Figure 2) implements a central integration point into which all source systems will link. If you follow the industry trend, you'll decide on a hub-and-spoke model for your integration.
Figure 2: Hub and spoke integration
Choosing a hub style
With our model decided, we move on to choosing either a registry, co-existence or transaction style hub. This decision is largely dependent on your business requirements. Which ever style you choose, establishing customer identity is essential and the first order of business.
More than likely, you will incrementally deliver CDI by first establishing customer identity in a registry style hub. This style is best described as a read-only integrated reference point in which customers are linked in the hub while their data is managed by each source system. Registry style hubs are commonly created in the analytic environment. Co-existence style hubs take integration a bit further by physically storing a copy of the master customer data in the hub. This style hub straddles the blurred line between analytical and operational environments. In transaction style hubs, CRUD (create, read, update and delete) activities are done directly to master customer data stored in the hub. The hub residing in the operational environment serves as a federated extension of the source system data stores.
In registry and co-existence hub styles, data latency will determine your synchronization scheme. Integration options include either pushing or pulling data with Enterprise Application Integration (EAI) or Extract, Transform and Load (ETL) processes. As you move closer to real-time integration you will undoubtedly need to leverage Change Data Capture (CDC) push technology into your strategy. CDC identifies, captures and delivers just changed data to your above mentioned integration processes.
Fortunately, your hub environment can migrate from analytical to operational as the hub evolves. With this approach we gain the initial ease of implementation in the analytical environment and the final operational efficiencies found in our transaction environment. If you plan to evolve your hub beyond a registry style, I recommend logically and physically placing your hub in front of your data warehouse as an E/R party model. This will provide the flexibility, extensibility and future ease of transition to your operational environment. If however your end goal is simply to establish identity, then consider incorporating your hub right into your existing data warehouse.
Who versus what is a customer
Now it's time to ask the right question? Let's start with, "Who is a customer?" Seems like a simple question, right? Try asking the stakeholders in your organization. Chances are, they won't agree on a single definition. Nor should you expect them too. Different business areas will describe a customer differently. Marketing, which probably includes prospects as customers, will never agree with Accounting which counts only clients who give you business. In fact, the number of ways to describe customers largely depends on the contexts in which their used. Too many times I've seen CDI initiatives stalled because there isn't consensus of "who" a customer is. So stop trying to water down a description of a customer that is so general it has little meaning and worse yet has no value to your company.
Instead, answer the question, "What is a customer?" A customer is best described by the relatively few attributes which uniquely identify the person, organization or role. These attributes (Figure 3) are referred to as personally identifiable information (PII). Remember, these attributes are plural. A customer can have several names etc. depending on their use. For instance, a business can have different legal, doing-business-as (DBA) and prior names as well as different addresses for mailing, billing and physical location. Spend your time identifying and profiling this data in your source systems. Ranking each source and attribute by their data quality will assist you later as you determine integration rules. This practical approach will get you started on the real problem of integration.
|Figure 3: Sample Personally Identifiable Data Elements|
Having identified the attributes which make each customer unique, it's finally time to integrate. So, how do we actually relate customers? The answer begins with global identification. Each customer in our integration hub is assigned a unique universal identifier. This global identifier becomes our surrogate primary key of our customer record. Our hub also becomes the system of record for personally identifiable information which defines "what" a customer is.
Source system customer keys are also carried as attributes on each customer record in our hub. Customer keys will serve as our bi-directional link between the hub and its source systems. This will save us from matching identity attributes each time a source system links to the hub or the hub to a source system. This customer record is the minimal design required for supporting customer identity.
Cleansing, matching and linking
The next step is to establish relationships between the hub and individual source systems (spokes). This problem space is more difficult than you might think. External or third-party data vendors provide cleansing, matching and linking services and should be your first choice. They have the data, expertise and computing power to succeed in this complex and underestimated problem space. If you're not convinced then try an internet search for "algorithms on strings" and appreciate the work done in this area. Strongly consider a "buy" versus "build" strategy for this important decision.
Regardless of your decision, you'll need to understand the sophistication behind these algorithms. Cleansing and matching identity attributes includes the application of data rules, business rules and statistical formulas. When combined with fuzzy logic these algorithms go from science to art. Expect each vendor to go about this differently.
Data rules are straight forward ranging from validation checks to standardization. For instance, an address validation would not only validate state and zip code attributes individually by verifying their values against a postal list but together by verifying that the zip code is valid for the particular state. Standardization on the other hand can involve correcting misspelled words to replacing common variations of dictionary words with a standard word.
After data rules have been applied, algorithms next attempt to match the cleansed data elements to an existing customer in the hub. Here is where we introduce business matching rules and things get tricky. Matching algorithms are numerous. Don't assume all matching is byte for byte pattern searches. Phonetic and statistical algorithms are frequently used to aid in the matching process. Aproximate or inexact matching better known as "fuzzy matching" is another technique. In many cases several complimentary matching strategies are employed. Typical matching algorithms start with a wide match, perhaps name and address, then move on to name and zip until a final phone number-only match is attempted. This can be further complicated by adding match percent criteria. For example, you may want to match only 40% of the name or address. In the end, match probability will depend on the match type, criteria and algorithms. Use and store this information to assign a confidence score and to track lineage. Matching rules may determine, with varying degrees of confidence, that all of the following customers (Figure 4) are the same.
|Name and Address||Match Type|
|Customer Hub|| Smith's Ultra Technology Inc.
10 Center Street
Mytown, CT 06410
|Source 1|| Smithes Tec & Servc Inc
Ten CNTR St.
|Name and address|
|Source 2|| Smithe Ultra Tech
|Name and Zip|
|Source 3|| Smith Ultra &
P. O Box 999
If we get a match then we'll store the customer key as an attribute in the hub's customer record. If no match is found then we'll establish a new customer record with a global identifier, standardized identity attributes and the source system's customer key. Having linked our customer, either by matching an existing record or creating a new record we can now use the customer key as a bi-directional link between our source system and integration hub. This completes the process of establishing customer identity.
This article has introduced you to the decision making process for establishing customer identity within CDI architecture. I hope you've gained a degree of insight into what it takes to deliver a single view of the customer. Solving this age old integration problem for your company can mean the difference between competing and leading in your industry.
Jeff McCormick is a Senior Data Architect at a financial services company and Executive Director of the Connecticut Oracle User Group. Jeff has worked in IT for almost 20 years as a system, storage and database architect/administrator and has over 15 years of experience with DB2, Sybase, SQL Server and Oracle relational database technology. He holds several certifications including Oracle Database 10g Administrator Certified Professional, Microsoft Certified Product (SQL Server) and Certified Sybase Professional Database Administrator. Jeff has performed extensive work in the area of high availability and disaster recovery, speaking and authoring several papers on availability architecture.