This article originally appeared on the BeyeNETWORK.
I was obsessed with codes as a child. These codes generally referred to spy materials, like code books, ciphers, translations, etc. I remember reading about the difference between a cipher and a code. Whereas a cipher was a direct mapping of letters, a code provided a directory of words, each with its alternate meaning.
Conceptually, the difference between the two lies in the cardinality of what we could call the “message space.” The number of messages that can be sent within the constraints of a code is limited, because each word in the code corresponds directly to its encoded meaning. The number of messages that can be sent within the constraints of a cipher is almost limitless as well. This is because the formalism maps one set of symbols to another set of symbols, and the meaning of the message is derived from its finalized context (i.e., what the letters spell out). In a code, your taxonomy and vocabulary is fixed. Moreover, each code word maps to a specific encoded meaning.
Besides the nostalgic value of reminiscing, this concept maps directly into master data management, especially as more people are interested in consolidating their various reference data sets into a single master repository. To motivate this discussion, I must first ask: What is the definition of a code in a data environment?
A code is a single value that represents a different single value. But why do we use them? There are two obvious reasons, at least in my mind. First, codes save space. We can condense long string values into a smaller representative mapped enumeration of smaller strings (or numbers). As a result, we can compress more information into a single data representation. Second, codes facilitate workflow. They do this by making it easy to write executable programs that perform different tasks depending on the code value.
A simple example that demonstrates both reasons for code is a North American Numbering Plan telephone number. The number is broken up into three chunks: an area code, an exchange code and a line number. These three chunks encode information that directs the process of establishing a telephone connection. This process goes as such:
- Find the switch for the area code.
- Find the switch at the central office for the exchange.
- Find the line number within that central office.
Alternatively, one could view the phone number as a value that represents the reachable party by calling that number. And conversely, we can create a mapping between parties and telephone numbers, and put the (party, telephone number) pairs into a specially designed table. This practice was common before the internet became the source for all information. For example, many companies collected these data pairs, indexed the pairs alphabetically by party name and published the table in a physical handbook. This handbook could be used for looking up telephone numbers. Interestingly, it could also be used as booster seat for toddlers!
The process of exploiting the mapping for additional purposes, however, can allow the introduction of flaws into your encoding semantics. Remember, each string has a single meaning in a code; here, we are beginning to use codes for more than one meaning, namely routing and identification. If you change the business environment, then you will suddenly have a recipe for disaster. For example, the original numbering plan embeds a geographical hierarchy associated with the location of the physical network, and it assumes the use of a network that links telephones connected physically by wires. Therefore, a telephone number with a 212 area code is expected to be attached to a phone jack in New York City.
Now consider the introduction of the mobile phone. While we use the same numbering plan to assign telephone numbers for the process of establishing a connection, the mobile phone is no longer physically tied to a location. Because of this, the assumed embedded geographical hierarchy is no longer true. This becomes problematic when the potential impact of making that assumption is significant. An example of this would be if someone visiting Los Angeles with a New York City mobile phone called 911. If the 911 system routes calls to the police station based on the area code, then the caller will be connected to an emergency response number approximately 3,000 miles away. Clearly, this is not likely to help him very much.
The assumption that could lead to this kind of problem is what I call “code overloading.” I suggested earlier that with a code, each particular code value maps to a single meaning. Code overloading happens when people ascribe additional meanings to codes already in use. In our telephone number example, the meaning of the area code involves message routing directives; the later assumption of geographic classification is how the code’s meaning is overloaded. Similarly, overloading is not a problem until the underlying business environment changes, and makes the assumptions false. Unfortunately, this happens more frequently than one might expect and the underlying businesses rules may eventually, subtly change. And these impacts appear long after the original overloading.
This is somewhat of a conundrum. The longer the overloading problem lies unnoticed, the more penetrating (and insidious) it will become. Sooner or later, it will get to the point where attempting to remediate the situation is just as intractable as ignoring it. As a routine part of data standardization and master data management, data analysis will reveal the existence of code overloading. But how do you deal with code overloading when it becomes a problem?
One practice we have used in our consulting engagements is a two-pronged approach. The first step is to clarify the situation through a review and documentation of all the meanings assigned to the code set, and evaluate the different remediation approaches and their corresponding impacts on production systems. We have typically seen a range of solutions here. Solutions are usually implemented in two ways. One side works to address the most acute needs with the smallest impact. However, they do not fully solve the problems. In contrast, the other side works to address better overall solutions. But these are more painful to implement and deploy. Our approach, though, is much more effective. We use the low-impact solution that catalyzes the long-term thoughtfulness required for better, more strategic solutions.