This article originally appeared on the BeyeNETWORK.
The process for consolidating name data requires identity resolution – the ability to assess the degree of similarity between two records and determine whether they meet or exceed the threshold for presumption of a match. Usually, prior to the identity resolution step, though, the data analyst can enhance the matching by standardizing the data values used as the matching criteria within all the records into a common format. This standardization process aligns the data values in a way that simplifies the matching algorithms.
A straightforward example using person data involves a matching algorithm that looks for exact matches in the last name field. Searching for duplicates within the set of records would require every record to be compared to every other record. A standardization that could be applied to this data set would be sorting the records by last name, thereby enforcing an ordering such that only records that are sorted into the same “neighborhood” in the lexicographic ordering need to be compared (that is, no name that starts with an “A” would match a name that begins with an “S”). Sorting is one typical standardization applied to a data set to simplify the data consolidation process.
However, there are many different kinds of standardizations that can be applied to data, especially individual data. An issue that appears over and over again is the concept of the need to make sense out of semi-structured name data, and we can focus specifically on name formats used within the United States as a good example. Typical name structures contain different components, the minimum usually containing a given name and a surname. We can abstract that concept into two name tokens that appear within an individual name, which often appear in databases using some variant of these typical attribute monikers: “first name” and “last name.”
The continually evolving practice of child naming leads to the introduction of additional given names, often ordered in the context of use or importance. This means that a name might contain more than one given or first name. The assumption that everyone has one first given name and a second given name leads to that second given name often being referred to as a “middle name,” which is yet another frequent attribute name appearing in many databases. Yet middle name, by virtue of its not being a surname, is in fact just an additional first name, but it is probably not the preferred name by which the individual wants to be called.
With me so far? Okay, good. Let’s make it easier for ourselves and refer to these name components using token names. We’ll call a first name a first and a surname a last. We can call a middle name a middle, but since it is just another first name, let’s just refer to it (also) as a first. So now we have a few patterns for names:
- first last
- first first last
Some people have more than one middle name, which means that, in fact, we might need additional formats, such as:
- first first first last
How about those people who only use initials? We will need a new token: initial; and that provides us with some more patterns that are derivatives of our first set of patterns:
- initial last
- first initial last
- initial first last
- initial initial last
- first initial initial last
- first first initial last
- first initial first last
- initial first first last
- initial initial first last
- initial first initial last
- initial initial initial last
Actually, sometimes people have more than one last name, either hyphenated or not, so we could add onto that pattern list a bunch of versions with multiple last names. In fact, we could go on for a while with these ideas, but let’s introduce some more complexity. Sometimes a person gives his child the same name he has; distinguishing those individuals is done through the use of an addition to the name, such as “Jr.” or “Sr.” When there are many generations of living people with the same name, they start to use numerals for distinction: “II,” “III,” “IV,” etc. We can give these tokens a name as well: generationaI. So now we can take each of our name patterns and then insert a generational somewhere within the pattern – realize that the generational doesn’t always appear at the end!
I will spare you from having to read through the myriad pattern variants that include a generational token, because I am going to introduce some new tokens:
- title – this includes general title used as part of etiquette (“Mr.,” “Mrs.,” etc.), as well as honorific titles (examples: “Doctor,” “Professor,” “Sir”) and attained titles (“General,” “The Honorable,” “The Most Reverend,” etc.)
- prefix – this includes name components that might not add significance to the matching process (“Da,” “Di,” “Von,” etc.).
- suffix – additional name strings that are not part of a name but may be associated with attainment (“PHD,” “ESQ,” “RN,” etc.)
There is now a list of seven name token types (last, first, initial, generational, title, suffix, prefix); and while there might be an expectation of an ordering of these tokens within a name string, it turns out that there is wide variation in the ways that these components appear within a string. Sometimes a record will have separate attributes for each one, but other situations will have the entire name string within a single data element. A name standardizer must, therefore, be able to recognize the different kinds of value strings that may be classified as one of these name token category types, and then reorder the name components into a normalized format that is then suitable to the subsequent matching phases.
Actually, that is not all. Sometimes a name string contains more than one individual – if you have ever looked at bank account information, you will see multiple names associated with a single account, such as “Mr. and Mrs. John Smith.” The name standardizer must now also recognize conjunctions within name strings and then differentiate the two components – “Mr. John Smith” and “Mrs. John Smith.” We now need patterns that include conjunctions as well as the ability to standardize more than a single name from a single record.
Actually, what starts out as seeming relatively simple ends up being quite complex, and we haven’t even gotten to the identity resolution part yet. Name standardization involves the ability to recognize the many different name components in many different formats and patterns and then the ability to extract the corresponding strings and rearrange them into a format suitable for subsequent patching phases. In fact, one of our current projects employs more than 850 different patterns, and that is solely for individual names – corporate or organizational names are not even touched on in this pattern set. And remember, we started out looking only at name formats common within the United States – various cultures around the globe have different protocols and expectations for names. In some countries, the surname comes first, followed by the given name; in others, there is only a single name, composed of a given name and a family name. It is this kind of complexity associated with the simple task of naming that provides the first (of many) challenges associated with individual data record integration.
David is the President of Knowledge Integrity, Inc., a consulting and development company focusing on customized information management solutions including information quality solutions consulting, information quality training and business rules solutions. Loshin is the author of The Practitioner's Guide to Data Quality Improvement, Master Data Management, Enterprise Knowledge Management:The Data Quality Approach and Business Intelligence: The Savvy Manager's Guide. He is a frequent speaker on maximizing the value of information. David can be reached at firstname.lastname@example.org or at (301) 754-6350.