|
|
||||||||||||||||||||
| Home > Managing unstructured data in the organization | |
| Book Chapter: |
|
||
Most organizations have two kinds of data: structured data and unstructured textual data. Because structured data preceded unstructured data in the workplace, unstructured data is often best understood in contrast to structured data. Structured data is data that is represented by numbers, tables, rows, columns, attributes, and so forth. Most IT professionals have spent the better part of their professional lives with structured data. As its name implies, structured data is usually disciplined, well behaved, predictable, and repeatable. Structured data usually is generated as by-products of doing a transaction. A check is cashed, an ATM activity is done, an insurance claim is made, a production run is completed, a car is sold—these are typical transactions that generate a lot of structured data about the activity that has been done. Structured data is made up of data types that are repeated continually. The same types of data are found in almost every transaction. The only things that differ from one transaction to the next are the values that the data types take. In addition to repeatability and predictability, the essence of the structured environment is numeric data. Although there is text in the structured environment, most text serves the purpose of identifying or describing some numeric data. The numeric data in the structured environment makes up the heart of the data that is found there and is heavily used for analytical purposes. Unstructured textual data The other major category of data found in the corporation is unstructured data. There are several forms—textual unstructured data and nontextual unstructured data, which includes images, colors, sounds, and shapes. This book is about textual unstructured data, which presents enough challenges on its own to fill a book (or even more than a book!). Unstructured textual data is textual data found in emails, reports, documents, medical records, and spreadsheets. There is no format, structure, or repeatability to unstructured textual data. There is no one sitting on your shoulder telling you what to do when you write an email. You can write anything you want, however you want, and use any language you want. In addition, there are other forms of text that occur well outside the email environs, such as contracts, warranties, spreadsheets, telephone books, advertisements, marketing materials, annual reports, and many more forms of textual information that are the fabric of the organization. In short, unstructured textual data occurs almost everywhere and represents both a challenge and an opportunity to the organization that wants to use it for decision-making purposes.
It is true that many forms of unstructured data are not text-based. There are X-rays showing bones and breaks, real-estate listings with pictures, engineering change control documents mapping the structural changes made to complex edifices, MRIs that show detailed aspects of the human body, and scientific photos that help mankind unlock the secrets of the universe. But the most basic form of unstructured data is in the form of text. The focus in this book is on text, which presents its own set of challenges. The purpose of this chapter is to provide an overview of unstructured textual data and the environment in which it sits. Unstructured textual data is so pervasive, is so ubiquitous, and has so many variations that it is hard to classify. In one place, unstructured textual data has one set of characteristics. In another place, unstructured textual data has a completely different set of characteristics. Because of this topsy-turvy, unpredictable nature, it is extremely difficult to generalize how to approach unstructured textual data. Following are some examples of the nonuniform characteristics of unstructured textual data:
And these are but a few of the types of unstructured textual data that exist in the organization. Each type of unstructured textual data has its own peculiarities and its own characteristics. Unstructured textual data and organizational functions To start to understand unstructured textual data, consider that there are different kinds of unstructured textual data associated with different functions within the organization. Table 1-1 shows some of the corporate functions and the unstructured textual data that is typical of those departments. Table 1-1 Corporate Functions -- Unstructured
The corporate function of accounting typically has spreadsheets, Word documents, audit reports, and audit trails associated with its activities. Call centers typically have recorded or transcribed conversations, replies, follow-up activities, and other notes associated with their activities. The engineering department typically has unstructured textual data associated with the bill of materials, engineering changes that have been made, production archives, and design specifications. Each of these different forms of unstructured textual data has its own set of characteristics. Unstructured textual data is not just endemic to different departments of the organization. Unstructured textual data appears in different forms and in different measures in different industries. Some industries have a lot of unstructured textual data while others have little. Table 1-2 shows unstructured textual data by industry and emphasizes the differences. Table 1-2 Industries and Unstructured Data
Table 1-2 shows that different industries have different mixes of transaction processing data and unstructured textual data. For example, banks are rich with transaction processing, checks, ATM activities, and other banking activities. Whereas banks certainly do have unstructured textual data, they do not have (relatively speaking) nearly as much unstructured textual data as they have structured data. On the other hand, medical environments are rich in unstructured textual data. The unstructured data is so ingrained in healthcare that it is part of the fabric of medicine and healthcare. Doctors write and take notes in a textual fashion, hospitals take notes textually, and so on. The world of medicine is rich in text, and certainly transactions occur in the medical environment. Patients get billed on a regular basis, for example. But—relatively speaking— the medical environment is heavily ingrained with textual data. Unstructured Data and its characteristics One of the perplexing aspects of unstructured textual data is that the characteristics associated with the different forms of unstructured textual data are mixed across the many different forms of the data. There is little uniformity among the different forms of unstructured textual data. Table 1-3 on unstructured textual data and characteristics points out the extreme lack of uniformity of the characteristics. Table 1-3 is complex and deserves an explanation. The columns are as follows: (To view Table 1-3, download a free .pdf of this chapter and scroll to pages 6-7.)
When you look down any of the columns, you see that there is little or no rhyme or reason to the characteristics. One form of unstructured textual data has one set of characteristics, and the next form of unstructured textual data has a completely different set of characteristics. It is this complete lack of characteristic pattern coupled with the complexity of language that causes the difficulty with the automated usage of data. The challenges of unstructured textual data and analytical processing Traditionally, analytic processing is used for business analysis of structured data. Structured data is particularly amenable to analytic processing because structured data is
For these reasons, analytical processing is a natural partner with structured data. However, unstructured textual data has none of these characteristics. To do analytical processing against unstructured textual data, it is necessary to overcome or address several obstacles. Some of these challenges follow:
This is just the short list of challenges that await the organization that attempts to come to grips with the unstructured textual data environment. The opportunities of unstructured textual data From the preceding discussion, it is obvious that there is a cost—in time, money, and manpower—to get a handle on the unstructured textual data environment. Why would an organization want to come to grips with the unstructured environment? A world of promise and opportunity in information is buried in the unstructured textual data environment. Organizations have a tremendous opportunity at making better decisions—more timely, more accurate, more informed decisions—when they incorporate unstructured information into the decision-making process. That is the reason why organizations need to come to grips with the unstructured textual data environment. Stated differently, organizations that look only at their structured data—usually transaction- based data—miss an entire class of information that waits to be used for the decision- making process. Organizations that base all their decisions on structured data use only a portion of the corporate information on which to base decisions. It is like a manager making decisions solely on revenues for this month. Although this month's revenues are an interesting figure and are certainly important, a lot of other types of information are factored in as well. There are monthly expenses, revenue figures for next year, the projections for next year, the size of the customer base, and new product announcements that need to be considered. So exactly what kinds of useful information can be gained from looking at unstructured textual data? Some examples of useful information that might be buried in unstructured textual data include the following:
But perhaps the biggest promise of unstructured textual data lies in its ability to be combined with structured data. As a simple example of combining structured data and unstructured textual data, consider emails. How important are emails for the creation of the complete picture for a customer? The answer is that communications with the customer are important. Consider the following simple example. An organization has a lot of demographic information about its customers. Suppose there is a Mrs. Jones who is a customer. The company knows that Mrs. Jones
With this information, the company thinks that it is prepared to deal with Mrs. Jones, establishing a personal and long-term relationship with her. So how important is demographic information in understanding Mrs. Jones? Very important. How important is Mrs. Jones' demographic information in the face of having no information about communications? Not important at all. It seems that Mrs. Jones ordered some goods a month ago. The goods were late, sent to the wrong address, and broken when they arrived. And Mrs. Jones sent a scalding email last week to the corporation. How important is it to know about communications when trying to establish a relationship with Mrs. Jones? Vitally important. And what are emails? They are nothing but a form of unstructured textual data. There are many cases where establishing a relationship between unstructured textual data and structured data leads to business opportunities that are today unimagined. More information about unstructured data Continue reading this chapter by downloading a free .pdf of "Managing unstructured data in the organization." Read another excerpt from this book: "Integrating unstructured text into a structured environment." Listen to a podcast with Bill Inmon about unlocking and integrating unstructured data. Read other excerpts from data management books in the Chapter Download Library.
'); // -->
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| About Us | Contact Us | For Advertisers | For Business Partners | Site Index | RSS |
| |
|
|||||||