This article originally appeared on the BeyeNETWORK.
It should not be surprising that we build our information systems with the intention of ensuring uniqueness of each core record. One main reason is that a good data model reflects the real world entities they are intended to represent; therefore, there should not be any duplication of data within the system. In turn, you could expect that there is a specific way of referring to any specific record – a primary key that is used across the enterprise to uniquely distinguish the record representing any entity from all others.
There are three aspects to this use of a primary key. We’ve already established the first, uniqueness, but the other two aspects expose subtle issues associated with any master reference set, especially when used to represent real people.
The second aspect is identity. While we maintain full records containing attributes associated with each individual entity, by assigning each record a unique identifier we are, in effect, encapsulating all that entity’s data into a single reference ID – quite a feat of compression! By assigning a unique reference identifier to an entity’s set of demographics, you might assume that for alternate analysis you could remove all “identifying information” in deference to the unique identifier, yet still benefit from the conclusions derived from the aggregate analysis.
On the other hand, data governance policies sometimes conflict with each other, and one aspect or dimension of data management that often is involved is privacy. This brings us to the third aspect – nonidentifiability, which implies that given a subset of the data within the record, it is not possible to determine the entity the record represents. In the presence of a desire for nonidentifiability, one must ensure that even the unique identifier itself does not carry any data that could be used to “reverse engineer” any information about the entity.
Let’s look at an example in the healthcare industry. Up until recently, many health insurers typically used an individual’s assigned social security number as their “member identifier.” The combination of a number of factors has made that practice obsolete: the rampant, unchecked use of the social security number (SSN) as an identifier across numerous domains, the privacy constraints imposed by the Health Insurance Portability and Accountability Act (commonly known as HIPAA), and the fact that social security numbers themselves carry information and dependency associated with the location of the individual to which the SSN was assigned. Therefore, while an SSN may have a relatively good chance of providing both uniqueness and its use defines identity, the SSN fails the nonidentifiability criterion. With enough digging, one could track down an association between an SSN and a potential location, if not also a person’s name.
Another interesting example is AOL’s release of about 20 million queries associated with “anonymized” search data for more than half a million individuals. Each individual was assigned a unique identifier, yet there was enough data embedded within the data set to enable identification of at least one individual (see http://www.computerworld.com/blogs/node/3180), and probably a lot more. The public outcry related to privacy concerns forced AOL to remove the data. So in this case, the identifier may score well with the uniqueness and identity aspects, but fails when it comes to nonidentifiability.
The challenges don’t stop there – in an operational environment, one must consider the data life cycle associated with the entities subjected to the uniqueness factor. A proactive approach to ensuring uniqueness forces all applications to determine whether a master record already exists for an entity before a new one is created. Yet in an environment where data transcription and entry errors perturb name strings as they go into application systems, in order to determine if a record already exists, one must consider all exact matches plus any matches of a degree of similarity that exceed a defined threshold.
The issue occurs when only similar matches appear, and those matches must be presented to the end-client to review to determine whether one of the proposed variations really is a match. But to do this, the system’s presentation of identifying information to the end-client may actually violate the governance constraints associated with privacy!
These simple examples highlight some of the potential intricacies hidden underneath a master data management initiative (MDM). Governance will conflict with technology, so when assembling your MDM program, governance must be designed into the end product.
At the high level, consolidation and integration into a single repository sound like great ideas, and there is no end to the lip service senior managers pay to the concept of governance. The devil is in the details, though, when assessing how your organization’s policies must be fully integrated and deployed as part of the MDM program, not just as an add-on.