|
|
||||||||||||||||||||
| Home > Data Management All-in-One Guides > Data integration tutorial > Introduction to data integration > Integrating unstructured text into a structured environment | |
| All-in-One Guides: Data integration tutorial: |
|
|||||
|
||||||
Introduction to data integration
![]()
|
||
The world of computing has grown from a small, unsophisticated world in the early 1960s to a world today of massive size and sophistication. Nearly every person worldwide—in one way or the other—is affected by or directly uses computation on a daily basis. Nothing less than national productivity from the 1960s to the present has been profoundly and positively affected by the widespread growth of the use of the computer. The growth of computing can be measured in two ways: growth in structured systems and growth in unstructured systems. Possibilities of unstructured systems Structured systems are those for which the activity on the computer is predetermined and structured. Structured systems are designed by, built by, and operated by the IT department. ATM transactions, airline reservations, manufacturing inventory control systems, and point-of-sale systems are all forms of structured systems. Structured systems are tied closely with the day-to-day operational activities of the corporation. Because of this affinity, structured systems grew quickly. Cost justification and return on investment for structured systems came easily because of the close tie-in with the day-to-day business of the corporation. The growth of the structured environment was fueled by the desire of the business world to be competitive and streamlined.
Unstructured systems are those that have no predetermined form or structure and are full of textual data. Typical unstructured systems include emails, reports, contracts, transcripted telephone conversations, and other communications. When a person does an activity in an unstructured environment, he is free to do or say whatever he wants. The person doing the communication can structure the message in whatever form is desired, using any language. In an unstructured environment, the communication can range from a proposal of marriage to a notification of a layoff to the announcement of the birth of a baby, and everything in between. There simply are no rules for the content of unstructured systems. The growth of the unstructured environment has been fostered by the needs for communications,
informal analysis (such as that found on a spreadsheet), and personal analysis
(of finances, personal goals, personal plans). There was (and is) a different set of
motivations for the growth of the two environments. Figure 4.1 shows the different
environments.
From the beginning, the worlds of structured systems and unstructured systems have grown separately and apart and yet—at the same time—parallel with each other. It is no surprise that today each environment is separate from the other environment in many ways:
In truth, there is little overlap or connection between the two worlds. Imagine what the world would look like if, indeed, there was overlap (or intersection) between the two environments. Imagine the possibilities if the two worlds could connect in an effective and meaningful way, the new types of systems that could be built, the new opportunities for the usage of computation, and the enhancements to existing systems in ways that are not possible using technology. When one accepts the limitations of today's technology and today's environment, there are only so many things that can be done. Imagine what would happen if those limitations suddenly disappeared. If a bridge is to be built between the two environments, it makes sense to bring the unstructured text to the structured environment. In doing so, the decision support analyst can take advantage of the analytical processing capabilities that exist in the structured environment. In most organizations an analytical infrastructure exists in the structured environment. This environment consists of things such as a database management system (DBMS), Business Intelligence (BI) software, hardware, and storage. Organizations have already invested millions in their analytical environment. The existing analytical infrastructure serves only structured systems, however. Data has to be put in a structure and a format that is particular and disciplined. Despite the particulars of the existing analytical infrastructure environment, it is less expensive to bring the unstructured data to the existing analytical infrastructure environment than it is to reconstruct the analytical infrastructure in the unstructured environment. By bringing unstructured data to the existing analytical infrastructure environment, the organization can leverage the training and the investment that has already been made in the existing analytical infrastructure environment. When the gap between unstructured data and structured data is bridged, an entirely new world of possibilities and opportunity for information systems opens up. Figure 4.2 shows that a bridge between the structured and unstructured environments has many benefits. The possibilities for new systems blossom when the gap between unstructured data and structured data is crossed. There are enormous and new opportunities that arise when the two types of data are merged.
Integrating unstructured textual data In second generation textual analytics, the key to crossing the bridge between the two worlds is the integration of unstructured text before it is sent to the structured environment. Raw unstructured text cannot simply be placed into the structured world and still be meaningful and useful. Stated differently, unstructured text placed directly into a structured environment creates a mess. There is too much data—data that has different meanings and is recorded as a single name, alternate spellings, extraneous words, and documents that have no bearing on business. All these limitations of unstructured text become manifested when unstructured data is moved whole cloth into the structured environment. To be effective, unstructured text must be integrated before it can be moved into the structured environment. By integrating unstructured text, the bridge between structured and unstructured data is created, and the stage is set for textual analytics. Reading the unstructured textual data The first step in the integration of unstructured text is the physical reading of the text. To be integrated, raw text must first be read or "ingested." In some cases, the text first appears in a paper format. In this case, the text on the paper must be read -- scanned -- and the text converted to an electronic format. This process is typically done in optical character recognition (OCR). There are quite a few challenges to this process of lifting text from a paper foundation:
As a rule, the process of converting from paper to electronics is one that involves a manual scan and correction after the electronic scan is done, if for no other purpose than to make sure the electronic scan is successful. In many cases, manual corrections must be made when the scanning and conversion process has made an error or the electronic scan process has made assumptions about what is read that are not true. However it is done, the text needs to be lifted from the paper media and converted into an electronic format. Then there is the case of voice recordings. Like data found on paper, voice data likewise needs to be lifted from the media in which it was stored and reset into an electronic format that is intelligible to a program that reads and analyzes text. Voice recordings can be converted to an electronic format by means of voice character recognition (VCR). Text can be lifted from VCR as well. The issues of quality and reliability for VCR are similar to OCR considerations. Choosing a file type When the text is in an electronic format, the format and structure of the text needs to be taken into account. Some of the typical formats for the reading of electronic text follow:
Often the vendor supplies software to read these file types. However, often the vendor does not guarantee a 100% successful reading. For this reason, third-party vendors supply software and software interfaces that are more efficient and more reliable than those supplied by the vendor. It is true that you have to pay for third-party solutions. Often the vendor supplies software to read these file types. However, often the vendor does not guarantee a 100% successful reading. For this reason, third-party vendors supply software and software interfaces that are more efficient and more reliable than those supplied by the vendor. It is true that you have to pay for third-party solutions. However, the third-party solutions are more reliable and more efficient than the vendorsupplied solutions. Also, the third-party vendor has the responsibility of keeping up with the different releases of the base software as new releases are made.
Reading unstructured data from voice recordings In some cases, where the text does not reside on paper, the text resides on tapes. Typical of this usage of tapes are telephone conversations that are taped and then transcribed. In this case, the tapes must be converted into an electronic format, much like scanning data, except the scan is not text. Typical software in this case includes VCR. VCR technology has many liabilities associated with it. VCR is subject to being fooled by accents, by people talking too softly, and other issues. As a rule, if a transcription can be done with 95% accuracy, that is considered to be good. It is an interesting point that humans do not hear and understand 100% of the words that are spoken. Our brains "fill in the blanks" frequently. So it is not unreasonable that VCR does not do a 100% job of accurate transcription. However it is accomplished, the original source text must be read and entered into the component that will begin the process of textual integration. After the source text has been read, the next step is to actually integrate the text. The purpose of textual integration is to prepare the data for textual analytics. It is true that raw text can be subjected to textual analytics. However, the reading, integration, and preconditioning of the raw source text sets the stage for effective textual analytics. Stated differently, textual analytics can be done on raw textual data, but not effectively. The data itself defeats much of the purpose of textual analytics. To be effective, textual analytics must operate on textual data that has been integrated and preconditioned. The importance of integration It is not always obvious why raw text needs to be integrated and preconditioned before it is useful and most effective for textual analytics. The following cases make the point of why integration of text is a necessary precursor to effective textual analytics. Simple search A simple search is to be conducted on the name "Osama Bin Laden." Operating on unintegrated data, the search fails to find references when the name "Usama Bin Laden" appears or the name "Osama Ben Laden" appears. If textual integration had been done properly, the search for "Osama Bin Laden" would have turned up all occurrences of all spellings of his name. Indirect search of alternate terms Suppose an analyst wants to find all places where there is a mention of a broken bone. If the analyst searches for "broken bone," the analyst finds all the places where there are permutations of the term. However, if data is integrated first, an indirect search for "broken bone" turns up the many terms that also mean "broken bone." Operating on integrated data, an indirect search on broken bone finds "fractured radius," "lacerated tibia," "oblique fractured ulna," and so forth. Indirect search of related terms In addition to looking for alternate terms, related terms can also be accessed by the textual analyst. Consider the term "Sarbanes Oxley." If a direct simple search is made on the term "Sarbanes Oxley," the search will turn up the many places where that term is found. Consider what happens when raw textual data is integrated before the search is done. An indirect search can discover the many terms that are related to Sarbanes Oxley. For example, when the raw text is integrated and a search is done on related terms, an indirect search on "Sarbanes Oxley" finds items such as the following:
Permutations of words Another interesting aspect of integrating text is the recognition of the roots of words. When raw unintegrated text is searched for the phrase "moving the needle," if that phrase is used anywhere, the search finds it. When raw text is integrated, permutations of the base word are recognized as well. For example, when a search is made for "moving the needle" on integrated text where the stems of words have been recognized, the results find the following:
From these simple examples of analysis of text against raw textual data and integrated textual data, it becomes obvious that if you are going to do effective textual analytics, the data that will be operated on must first be integrated. The issues of textual integration The kinds of issues that must be addressed in the integration of unstructured text into the structured environment include the following:
These basic activities of integrating unstructured text are the minimum subset of processes that need to occur to provide a sound foundation in the preparation of text for textual analytics. Many other related processes can be applied to unstructured text as it is prepared for movement to the structured environment. More about integrating unstructured data and text Continue reading this chapter -- for information about simple integration applications -- by downloading a free .pdf of "Integrating unstructured text into a structured environment." Read chapter one from this book: "Managing unstructured data in the organization." Read other excerpts from data management books in the Chapter Download Library.
|
|
|||||||||||||||||||||||||||||||||||||||||||||||||
| About Us | Contact Us | For Advertisers | For Business Partners | Site Index | RSS |
|
|
|
|||||||