In previous articles, we have discussed the need for textual integration and compared it to the need for systems integration. Indeed, from the broadest perspective of ETL (extract, transform and load), we see that integration of information – text or systems – leads down a very productive road. So it is very heartening that the world is discovering the value of integration.
However, when examined closely, it will be seen that the actual techniques for integration for systems data and textual data are about as similar as an elephant is to a waterfall – about as different as any two things may be.
What exactly is meant by systems integration? What is it that ETL does for us that is so valuable?
Systems integration consists (at the least!) of:
Converting different forms of data to a common measurement. One system measures data in U.S. dollars. Another system measures money in terms of the Euro. Another system measures money in terms of the Mexican peso. It does not make sense to add Euros, dollars and pesos together – the answer just wouldn’t mean anything. To make sense of money, there must be a common unit of exchange. The data must be integrated into a common measurement before it can be meaningfully combined.
Creating a common definition of data. The term “customer” is found in different systems. In one system, customer refers to only existing customers. In another system, customer refers to people who may one day become a customer; and in yet another system, customer refers to someone who once was a customer. If data from these systems is freely mixed, the result will be alphabet soup. In order to be effective, a clarified definition of what a customer is must be established.
Creating a common format. In one system, dates are in the format of YYYYMMDD. In another system, the format for date is YYMMDD. In another system, the format for date is DDMMYYYY. No meaningful comparison of these dates can be made unless a common format is chosen and the dates are all reset to the common format.
Operating from a common calendar. In one system, the standard actual calendar dates are used. In another system, there is a corporate calendar that states that months end on the 28th of the month. In another calendar, there are a series of fifteen-day periods. Unless these corporate calendars are resolved, adding data based on different corporate calendars will produce unreconcilable results.
Using common key structures. There are three systems, each with a customer ID. Unfortunately, each system has a completely different interpretation of what the key structure should look like. Unless the key structures are reconciled, merging information or comparing information from the different systems becomes impossible.
Using common encoded values. In one system, “GM” refers to General Motors. In another system, “GM” refers to General Mills. Unless we want revenues and other information for auto parts to be mixed with revenues for cereals, we need to rectify the discrepancy before we can call the system integrated.
There are then a host of things that must be done in order to integrate data at the systems level. This short list has discussed only a few of the considerations.
What must be done to integrate data at the textual level? Textual integration consists of (at the least!) these activities:
Stemming words. Stemming words recognizes that many words come from a common stem. The words “moving,” “moved,” “moves” and “mover” all come from the common stem “move.” Textual integration requires the stemming of text as the first step toward achieving integration.
Synonym replacement. Synonym replacement requires that synonyms be replaced. As an example, “oblique fractured ulna” could be replaced with the term “broken arm.”
Synonym concatenation. Occasionally, it is advisable to not replace one word with another. It is instead advisable to recognize that concatenation of a synonym is desirable. As an example, the term “oblique fractured ulna” could be replaced by “oblique fractured ulna [broken arm].”
Homograph resolution. The opposite of synonym replacement and concatenation is homograph resolution. Suppose the term “ha” is found in a medical environment. The analyst could go back into the system and for all cardiology text replace “ha” with the term “heart attack” and then replace the term ”ha” with “Hepatitis A” for all endocrinologist data. In doing so, the text now has a much clearer and precise meaning.
Negativity exclusion. As part of textual integration, wherever a negative is found, text can be deleted. For example, when the text “not a heart attack” is found, the indexing of “heart attack” does not include this reference.
By looking at the techniques required to achieve textual integration, it is seen that these techniques are completely different than the techniques needed to achieve systems integration. The irony here is that integration in both cases is very valuable, but how integration is achieved is completely different.