This article originally appeared on the BeyeNETWORK.
Everyone has peeled an onion. It has this dry, sometimes dirty paper layer on the outside. Hiding behind this inedible and usually unclean surface is where the real onion – the edible onion – lies. In order to get to the usable part of the onion, you have to peel away the outer, inedible layer. Any cook – even a novice – can tell you about onions (perhaps with a tear or two in the eye).
And so it is with unstructured data. Unstructured data, of course, is the textual data that is found in e-mails, spreadsheets, documents, medical records, contracts and so forth. Most industry estimates say that there is 4 to 5 times the unstructured data in the corporation than there is structured data. In other words, in a standard corporation, 80 to 85% of the data is unstructured. This usually comes as a surprise to the technician who has spent his/her life working with structured systems and structured data.
Furthermore, there is a wealth of useful and relevant information that is tucked away in the unstructured environment. Indeed, marketing, sales, management and a whole host of others find very valuable information in the unstructured environment.
But also hiding in the unstructured environment is “fluff” or “blather.” Blather is information in the unstructured data that has no relation to the business. A common place for blather to be found is in e-mail. Suppose you write the following e-mail to your spouse, “Honey, let’s go out for dinner tonight.” Such e-mails are very common, yet they have very, very little relationship to the day-to-day business of the corporation. Only in the most far-fetched circumstances does blather such as this have anything at all to do with business.
So how much blather is there in unstructured data? Estimates vary widely – from 20% to 80%, depending on the corporation and the business at hand. And what is the problem with blather? The answer is that blather gets in the way. If a corporation decides to take a good look at their unstructured data, blather becomes a smoke screen behind which hides useful corporate information. Stated differently, if a corporation decides to become serious about unstructured data, the first step must be the removal of blather because blather will turn an unstructured database into a data junkyard.
Therefore, in order to meaningfully search and analyze unstructured data, you must first remove the blather, much like the cook must first remove the outer layer of the onion before the onion becomes useful.
This then poses a very nontrivial problem to the analyst looking at unstructured data – how do you know what unstructured text is and is not blather?
So how would one go about peeling back the layer of blather found in unstructured data? One approach is to have someone read e-mails on a daily basis and make the separation between blather and non-blather based on his or her judgment. This approach is only effective for a very short period of time. Assuming that there is a lot of e-mail, the person reading the e-mail has a fried mind by 10:00 a.m. And, besides, having someone reading all corporate e-mail violates several security restrictions and policies. Thus, manually monitoring e-mail simply is not a viable approach.
A second approach is to create a “comb” of relevant corporate terms. E-mails are passed through this comb and all e-mails that contain business relevant terms are placed in one pile, while messages that contain no business relevant terms are placed into another pile. This “comb” approach can handle millions of e-mails in an automated manner.
While not perfect, the “comb” approach is the first step in separating business relevant e-mails from non-business relevant e-mails. At least most of the blather is peeled away.
Bill is universally recognized as the father of the data warehouse. He has more than 36 years of database technology management experience and data warehouse design expertise. He has published more than 40 books and 1,000 articles on data warehousing and data management, and his books have been translated into nine languages. He is known globally for his data warehouse development seminars and has been a keynote speaker for many major computing associations. Bill can be reached at 303-681-6772.
Editor's Note: More articles, resources and events are available in Bill's BeyeNETWORK Expert Channel. Be sure to visit today!