Uniqueness, Identity, Nonidentifiability and Governance

An organization's policies must be fully integrated and deployed as part of the master data management program, not just as an add-on.

This article originally appeared on the BeyeNETWORK.

It should not be surprising that we build our information systems with the intention of ensuring uniqueness of each core record. One main reason is that a good data model reflects the real world entities they are intended to represent; therefore, there should not be any duplication of data within the system. In turn, you could expect that there is a specific way of referring to any specific record – a primary key that is used across the enterprise to uniquely distinguish the record representing any entity from all others.

There are three aspects to this use of a primary key. We’ve already established the first, uniqueness, but the other two aspects expose subtle issues associated with any master reference set, especially when used to represent real people.

The second aspect is identity. While we maintain full records containing attributes associated with each individual entity, by assigning each record a unique identifier we are, in effect, encapsulating all that entity’s data into a single reference ID – quite a feat of compression! By assigning a unique reference identifier to an entity’s set of demographics, you might assume that for alternate analysis you could remove all “identifying information” in deference to the unique identifier, yet still benefit from the conclusions derived from the aggregate analysis.

On the other hand, data governance policies sometimes conflict with each other, and one aspect or dimension of data management that often is involved is privacy. This brings us to the third aspect – nonidentifiability, which implies that given a subset of the data within the record, it is not possible to determine the entity the record represents. In the presence of a desire for nonidentifiability, one must ensure that even the unique identifier itself does not carry any data that could be used to “reverse engineer” any information about the entity.

Let’s look at an example in the healthcare industry. Up until recently, many health insurers typically used an individual’s assigned social security number as their “member identifier.” The combination of a number of factors has made that practice obsolete: the rampant, unchecked use of the social security number (SSN) as an identifier across numerous domains, the privacy constraints imposed by the Health Insurance Portability and Accountability Act (commonly known as HIPAA), and the fact that social security numbers themselves carry information and dependency associated with the location of the individual to which the SSN was assigned. Therefore, while an SSN may have a relatively good chance of providing both uniqueness and its use defines identity, the SSN fails the nonidentifiability criterion. With enough digging, one could track down an association between an SSN and a potential location, if not also a person’s name.

Another interesting example is AOL’s release of about 20 million queries associated with “anonymized” search data for more than half a million individuals. Each individual was assigned a unique identifier, yet there was enough data embedded within the data set to enable identification of at least one individual (see http://www.computerworld.com/blogs/node/3180), and probably a lot more. The public outcry related to privacy concerns forced AOL to remove the data. So in this case, the identifier may score well with the uniqueness and identity aspects, but fails when it comes to nonidentifiability.

The challenges don’t stop there – in an operational environment, one must consider the data life cycle associated with the entities subjected to the uniqueness factor. A proactive approach to ensuring uniqueness forces all applications to determine whether a master record already exists for an entity before a new one is created. Yet in an environment where data transcription and entry errors perturb name strings as they go into application systems, in order to determine if a record already exists, one must consider all exact matches plus any matches of a degree of similarity that exceed a defined threshold.

The issue occurs when only similar matches appear, and those matches must be presented to the end-client to review to determine whether one of the proposed variations really is a match. But to do this, the system’s presentation of identifying information to the end-client may actually violate the governance constraints associated with privacy!

These simple examples highlight some of the potential intricacies hidden underneath a master data management initiative (MDM). Governance will conflict with technology, so when assembling your MDM program, governance must be designed into the end product.

How is this done? Just as business objectives translate into business rules that are composed of data rules, policies should be similarly decomposable. In fact, take a few minutes to visit your favorite Web site and review their stated privacy policies, and you might be encouraged to see how the prose is just a wrapper for a bunch of business rules. Here is a line from one organization’s privacy policy: “The personal information that is shared with third parties is limited to company name and job title.” This can easily transform into an assertion on shared data records that can be automatically enforced before any data is exchanged.

At the high level, consolidation and integration into a single repository sound like great ideas, and there is no end to the lip service senior managers pay to the concept of governance. The devil is in the details, though, when assessing how your organization’s policies must be fully integrated and deployed as part of the MDM program, not just as an add-on.

David LoshinDavid Loshin
David is the President of Knowledge Integrity, Inc., a consulting and development company focusing on customized information management solutions including information quality solutions consulting, information quality training and business rules solutions. Loshin is the author of The Practitioner's Guide to Data Quality Improvement , Master Data Management, Enterprise Knowledge Management:The Data Quality Approach  and Business Intelligence: The Savvy Manager's Guide . He is a frequent speaker on maximizing the value of information. David can be reached at loshin@knowledge-integrity.com or at (301) 754-6350.

Dig deeper on MDM best practices

Pro+

Features

Enjoy the benefits of Pro+ membership, learn more and join.

0 comments

Oldest 

Forgot Password?

No problem! Submit your e-mail address below. We'll send you an email containing your password.

Your password has been sent to:

SearchBusinessAnalytics

SearchAWS

SearchContentManagement

SearchOracle

SearchSAP

SearchSOA

SearchSQLServer

Close