This article originally appeared on the BeyeNETWORK.
Corporations are slowly realizing the value of data integration. From a world of legacy systems, where data and processing were splintered, comes the awareness that there is real value in having integrated data. This realization came after years of suffering with data that was not integrated.
What can a corporation do with integrated data? The world is your oyster when your data is integrated. You know who your customer is. You know where to cross sell. You know how to predict what your customer will need. And you know how to segment your customer base.
So how do we currently perform data integration? Today, data integration occurs in the structured world. The structured world is made up of records, tables, attributes, indexes, and other variables. We integrate data is to reconcile key values, make sure encoded values are consistent, supply default values where needed and examine the semantics of attributes of data. Simply put, we do what the ETL tools allow us to do. And most ETL tools are robust in terms of their ability to integrate structured data.
We are currently entering a brave new world. This is the world of unstructured data. Unstructured data includes emails, telephone conversations, documents, spreadsheets and the like. When examining our past data integration efforts, we find the approaches to integration in the unstructured world simply do not work. In the structured world, data integration required looking at records, tables and attributes. But when we look at the unstructured world there are no tables, records, and attributes. In fact, there is no structure whatsoever. When you look at a telephone conversation or an email, there are simply no rules for structure. People can say anything they want, in any order, in any language. Thus, trying to apply the data integration techniques that worked in the structured environment simply doesn’t work here.
What does work, though, is integration at the language level. Suppose you have three emails and two telephone conversations. How would you integrate that data?
To integrate this data, you must start by removing extraneous words. Extraneous words – called “stop words” in the vernacular – are words such as “a,” “an,” “and,” “the,” “is,” “was,” etc. Although these words are needed as part of our basic language, they do not add meaning to what we are saying. Instead, these words get in the way.
Then we reduce words to their root stems. We recognize that the words “move,” “moving,” “moved,” and “moves” all originate from the same stem – “mov.”
After doing this, we usually look for synonyms. If one document uses the term “car” and another document says “automobile,” we must substitute “car” for “automobile” wherever “car” is found. By doing this, we can integrate data across different documents. We must also perform similar processing for alternate spellings of the same word. Some examples of this are “Bill,” “Billy,” “William,” “Will” and “Billie.”
After looking for synonyms, we look for cases when one word is used to mean two different things in different places. One example of this is evident when looking for the word “drug.” Whereas the word means prescription medication in one context, it means a controlled substance (like marijuana or heroin) in another. We can take the places where it means one thing and substitute a more appropriate meaning.
Then, we can compare words for commonality across different documents. We can take the domain of the words that are left and see which words repeatedly appear across different documents.
There are also other activities that can be done, depending on our level of sophistication. We can look at the juxtaposition of words. We can look at the frequency and proximity of words. All of this can be done to understand context.
The integration of language is entirely different from the integration of structured systems. This brief article has merely begun to outline what is needed for language integration.
Bill is universally recognized as the father of the data warehouse. He has more than 36 years of database technology management experience and data warehouse design expertise. He has published more than 40 books and 1,000 articles on data warehousing and data management, and his books have been translated into nine languages. He is known globally for his data warehouse development seminars and has been a keynote speaker for many major computing associations. Bill can be reached at 303-681-6772.
Editor's Note: More articles, resources and events are available in Bill's BeyeNETWORK Expert Channel. Be sure to visit today!