This article originally appeared on the BeyeNETWORK.
Structured data is the data found in records in databases, usually the result of a transaction. Structured data has records, fields, indexes and so forth. For example, structured data results when: a bank honors a check and notes the account, the date, the amount and to whom the check was paid; or an airline makes a reservation, records the flight, the date, the person and the seat number.
Unstructured data is text. Text is found in e-mails, telephone conversations, medical records and reports. Unstructured data is free-form text. There are no rules for writing an e-mail, for the format of a telephone conversation or for naming columns of a report. In a word, unstructured data is exactly that – unstructured.
How can a linkage be formed between such polar opposites as structured data and unstructured data? There are actually at least five ways that linkage can be formed in the second-generation data warehouse. Each has its advantages and disadvantages, and is useful in different contexts. The five ways of linking structured and unstructured data are:
Hard wiring occurs when an e-mail address or a telephone number is found in both the structured and the unstructured environments. When a match is found, there is the inference of a relationship. However, that relationship must be considered carefully. Under normal circumstances, when a telephone conversation indicates that the call originated from a phone number belonging to me, it is assumed that I made the telephone call. Because my wife, my daughter or my business partner may occasionally use my phone, this is not always a valid assumption. Another limitation of hard wiring is that it is applicable to unstructured communications and not much else.
Probabilistic matching occurs when there is a chance that a match – once made – may be incorrect. As an example of probabilistic matching, suppose that both the structured and unstructured environments have a person named John Smith. A match can be made on the name because the names are the same. However, what are the chances that the person that is referenced in one environment is actually the person that is referenced in the other environment? The answer is that because John Smith is a common name, the odds are pretty slim. If both of the environments contained a person named Bill Inmon (a less common name), there is a much higher probability of an actual match, but there is still a chance of an actual mismatch. When other information is thrown into the mix – address, date of birth, etc. – the odds of an actual match are greatly increased.
What if there is an attempt to match words, not names, from the different environments? Suppose that the word “farm” is found in both the structured and the unstructured environments. What if “farm” in the structured environment refers to the baseball farm system (where young baseball players are given the chance to hone their skills before they are sent to the major leagues), but the word “farm” in the unstructured environment refers to a place where crops are grown? To match these two words without further knowledge would obviously lead to great confusion.
Usually a match on metadata is a good indication that there is a match at the detailed data level. However, in most cases, the unstructured environment has very little metadata; therefore, a metadata match does not occur frequently.
In both the structured and the unstructured environments, it is possible to create indexed fields. It is very easy to do so in the structured environment, where data is already organized into records and fields; but it is also possible to find data and create indexed entries on the data in the unstructured environment. When done properly, the creation of indexed entries in the unstructured environment paves the way for a natural and effective means of linking the structured and unstructured environments.
These are the common means of linking structured data and unstructured data. Further information on linking structured and unstructured data in a second-generation data warehouse – DW2.0 – is available free of charge at http://www.inmoncif.com/.