This article originally appeared on the BeyeNETWORK.
Structured systems have been around since the beginning of "computerdom." We have had files, record layouts, databases, programs, transactions and reports for as long as there has been computing. We have had COBOL, assembler, MVS and NT almost as long.
Unstructured systems have grown up in the same world. Unstructured systems are based on e-mails, documents, textual reports, spreadsheets and the like. They have content which is unfettered by any sort of discipline. You can write an e-mail any way you like.
While structured and unstructured systems have grown up side by side, the unstructured world is one where there is a lot more data. Most estimates place 80% of data in the unstructured world and 20% of data in the structured world – much to the amazement of the professionals who have spent their entire life working in the structured world.
Recently it has been observed that there is great business merit and opportunity that has been buried in the unstructured data of corporations. It has long been recognized that there is merit in structured data, but the news that there is business merit in the unstructured data comes as somewhat of a surprise.
The question then becomes: How do I unlock the business value in unstructured data? There are essentially two approaches. One approach consists of going to the unstructured environment, writing code and conducting analysis in that environment. The second approach is to go to the unstructured environment and find important data, structure the data and bring it back to the structured environment. Both of these approaches have merit.
The approach of going directly to the unstructured environment and processing it there is called “content management” by many people. There are many challenges to this approach. The first challenge is in reading unstructured data and making sense of it. Reading unstructured data is actually easy – making sense of it is another matter altogether. There are no keys. There are no records. There are no formats to tell you what data you have. In a word, the unstructured data is just there. You can do with it what you wish. Unfortunately, because of the complete lack of structure, it is difficult to do much with raw unstructured data.
However, there is another problem with trying to do analytical processing (or any other processing) in the unstructured environment. The problem is that the infrastructure for dealing with unstructured data must be built in the unstructured environment. This means that large volumes of data must be handled, that query tools for unstructured data must be created, that software that understands how to decipher unstructured data must be written and so forth. In a word, the infrastructure for the handling of unstructured data presents its own challenge.
The second approach is to find the useful business data in the unstructured environment, capture that data, set it into an acceptable and useful format, and send the data to the structured environment. Large amounts of data can be managed in the structured environment because query tools already exist, and indexing, access of data and record management have long been the norm. A major advantage in bringing important data from the unstructured environment into the structured environment is that management has already paid for the technology that exists in the structured environment. If management has a choice, it is much less expensive and much more convenient to do all processing in the structured environment rather than to have to reinvent all processing – data management, querying, analysis – in the unstructured environment.
There is one other major advantage to merging the two environments. By merging the two environments, it is possible to have integrated data – data from both the structured and the unstructured environment – working cooperatively.