This article originally appeared on the BeyeNETWORK.
In this article, we take a hard look at the differences that drive the unstructured data realm and introduce new ideas that are required to necessitate the integration of the unstructured data set within data warehousing.
We have long realized that only 20% of our data lives in a structured world. Now it’s time to address the 80% of unstructured and semi-structured data – gaining both further insights into our systems and additional understanding of our business and its profitability. The main topics include:
- What is unstructured data?
- What types of functionality link unstructured data to structured data?
- How does it pertain to DW 2.0?
- What makes the life cycle different than the traditional data warehouse life cycle?
- What types of issues are involved when combining unstructured data with data warehousing?
- What about auditability and compliance in DW 2.0?
Unstructured data is everywhere; it is everything in the organization that does not appear to have a highly structured definition. Experts in the industry have multiple definitions for unstructured data, but the good news is that most of these definitions coincide with one another.
Common unstructured data sources include:
- Documents: Word documents, PowerPoint presentations, newsletters, source code, hard-copy documents
- Images and graphics
Common sources of semi-structured data sources include:
- TCP/IP packets
- Images and graphics
- Zipped files
- Binary executables
- Documents (all listed previously)
Is Unstructured Data an Oxymoron?
I believe the physical term may be an oxymoron; however, the use of the term to classify information that we do not house currently within an RDBMS, CSV or MS-Access system certainly lends itself to discussion. Here’s a quote about unstructured data that is interesting:
To be perfectly clear: "unstructured data" is an oxymoron. Unstructured bits, characters, and words are not data, they are gibberish. Or noise. Or, to use a data processors' term, garbage. While the market for garbage processing seems rather confined to the Silicon Valley Toxics Coalition, somehow vendors of "unstructured data processing" have been able to raise millions in venture capital. So to what, exactly, are these vendors referring?
The point in this series isn’t to debate the definition of unstructured data, but rather to clarify the term and its utilization in classifying that which we do not already embed within our traditional data warehousing solutions. To the blogger’s point – the data itself is, in fact, structured – it all depends on what analysis tool is used to scan the data set.
Context is what we’re after. Maybe we can’t achieve true context, but we can assimilate and establish cross-references, indexes and relevance ratings to all the content outside our traditional data warehouse (that usually resides within a relational database system). However, for this article, I will not dive further into context or meaning, except to offer an oversimplified hypothesis of what I think context might be: context is the nature of formulating understanding of content and placing content within a known set of bounds from which to interpret specific results and infer particular meaning.
Why are These Lists so Short?
Well, they identify a whole range of items that exist within the enterprise. Rather than list specific types of items (as I began doing in the semi-structured list), I decided it would be best to classify unstructured data. However, you may have already realized that everything that’s on the unstructured list is also on the semi-structured list.
Why is Unstructured Data also Semi-Structured?
Every piece of data exists in one form or another under a given structure. Somewhere in the Word document, even in the text file (under the covers), there’s a structure to hold the bits and bytes together as a unit of work. Images have specific image converters, which will take images from one semi-structured format to another. Word documents and PowerPoint files have specific structures to enclose their bits and bytes as well.
Where is Unstructured Data?
The unstructured data itself is the actual image, or the actual sentences, words and alphanumeric characters held within the documents. This is where the knowledge is buried. This is where the 80% of corporate information is stored. Functionality or interpretation of the unstructured data is what makes it important – how it’s linked in with the existing structured information.
Of course, this begs the question: what is structured information? In general, structured information is that which is defined on a cell level by column and data class. For instance, first name in a database – in the context of a table called Customer, provides both structure and basic meaning to the information it contains.
The chasm between structured and unstructured knowledge is what must be crossed in order to make sense of an organization’s full data set. Indexing, ranking, scoring, tagging and maintaining edit trails are all parts of linking “unstructured” information into a structured world. These actions usually require a different approach than that of the standard relational database world.
Unstructured or Semi-Structured Data in the DW 2.0 Architecture
In Figure 1, based on Bill Inmon’s DW 2.0 specifications, we’ve connected all possible data sets to the data warehouse. This includes active, batch-oriented, documents, transactions, traditional data sets, indexes, rankings and so on. We’ve connected this enterprise data warehouse to the enterprise service bus – which, of course, connects it to all the source systems and all the other technical components that it must interact with.
Why are there Separate Stores in DW 2.0?
The first reason is volume, the second reason is response time and the third reason is establishing context. In order to establish context or link the information at the right place, a set of “active tools” must be enabled. The tools along the top are all active – and are running continuously on streams of data. They are also utilizing live indexes that reach out to machines to watch for edits to documents, changes to images and newly arriving e-mails. If the current data warehouse is 456TB and it’s only 20% of a company’s total data assets, imagine how large the information store would need to be to house the other 80%.
The active tools run 24x7x365 to monitor, authorize, and manage external and internal content. External content can exist on PCs, desktops, mobile devices and the Internet. Not all content will need to be stored locally (within the confines of the physical data warehouse); however, some of the content must be stored locally for mining and correlation analysis.
What does the life cycle look like compared to the traditional corporate information factory (CIF)? We’ve done our best to bring you an updated life cycle image based on an older version of the CIF, and I believe Figure 2 represents the necessary components appropriately.
As you can see, there are different types of sources: external, document management, Internet, desktops, servers (e-mail, etc.), image farms and so on. The nature of acquisition or the “ETL” of the unstructured world is: data location, indexing, ranking and possibly use of localized storage (archival storage). The data warehouse becomes a “term and index warehouse,” with pointers to the documents. The operational data store becomes an operational index store – with active data, active rankings and active relevance ratings. This data is updated every time a monitored document is edited.
What Does This Mean to Data Delivery?
Finally, delivery or the business intelligence (BI) of this world is the “Document Correlation and Delivery” mechanisms. These are not usually your traditional BI reporting tool sets. These tools are more akin to Inmon Data Systems, Google Search Appliances and the like. However, one thing we will begin to inherit from the traditional enterprise data warehouse (EDW) space is the notion of analytics and predictive analytics. Governance, data stewardship, metadata and management will all take a front seat. Why? Because privacy is at stake, legal issues must be resolved and the company executing on an unstructured data EDW – or UDW – must begin to show due diligence in collecting information from company assets, oftentimes without the knowledge or approval of its employees.
The analytics will change – instead of pivoting data, we will play what-if games on context and spatial relationships. Data visualization will move to the forefront of importance. Land-mapping of documents and their relevance to the business question being asked will become standard. The analytics will provide insight into term relationships, and possibly snippets of multiple documents that show some context to the terms being indexed or searched.
The “term exploration warehouse” will become the data mining center of the future – allowing the contextual analyst to play with the relationships, meanings and interpretation of the documents and e-mails. There will be more on this as we develop this series.
As stated, governance and compliance, along with metadata and business process flows are of importance here. It is no longer good enough to match existing “machine / operational” processing to business requirements (while it doesn’t go away, the nature of this task changes). DW 2.0 includes governance, compliance and metadata by nature. The importance of these tasks is forced by the access of unstructured source data resources. There are questions of ownership, ethics, quality, understanding and, most of all, interpretation. What will this data be used for? Why is it being collected? How will the business weed out personal information or “commentary filled with emotion” from fact?
These are the true questions within a UDW, and as we move forward in this series of articles, we hope to uncover some basic starting points for organizations struggling to answer these questions. In the next article, we dive a little deeper into the technical nature of unstructured data and take a look at form versus function along with utilization and interpretation. For now, I welcome all feedback and comments you may have.
Learn how to take full advantage of previously untapped sources of data by incorporating data from traditional sources as well as from text documents, e-mails, images and articles. Unlock the wealth of valuable information in your organization and deliver an increased data analysis capability. Enhance the operational and analytical capability of your organization by attending the DW 2.0 Conference that features the architecture developed by the "father" of data warehousing – Bill Inmon. Don’t let an outdated architecture impact your organization’s performance. Learn from experts, including Bill Inmon, Joyce Norris-Montanari, Sid Adelman and many more, why the architecture for the next generation of data warehousing is a more effective business intelligence tool. Click here for more information or to register.
About the Author
Cofounder of Genesee Academy, RapidACE, and BetterDataModel.com, Daniel Linstedt is an internationally known expert in data warehousing, business intelligence, analytics, very large data warehousing (VLDW), OLTP and performance and tuning. He has been the lead technical architect on enterprise-wide data warehouse projects and refinements for many Fortune 500 companies. Linstedt is an instructor of The Data Warehousing Institute and a featured speaker at industry events. He is a Certified DW2.0 Architect. He has worked with companies including: IBM, Informatica, Ipedo, X-Aware, Netezza, Microsoft, Oracle, Silver Creek Systems, and Teradata. He is trained in SEI / CMMi Level 5, and is the inventor of The Matrix Methodology, and the Data Vault Data modeling architecture. He has built expert training courses, and trained hundreds of industry professionals, and is the voice of Bill Inmons' Blog on http://www.b-eye-network.com/blogs/linstedt/.