This article originally appeared on the BeyeNETWORK.
In this article, we take a hard look at the differences that drive the unstructured data realm and introduce new ideas that are required to necessitate the integration of the unstructured data set within data warehousing.
We
Requires Free Membership to View
When you register, you'll begin receiving targeted emails from my team of award-winning writers. Our goal is to keep you informed on the hottest data and information management trends today.
Hannah Smalltree, Editorial Director- What is unstructured data?
- What types of functionality link unstructured data to structured data?
- How does it pertain to DW 2.0?
- What makes the life cycle different than the traditional data warehouse life cycle?
- What types of issues are involved when combining unstructured data with data warehousing?
- What about auditability and compliance in DW 2.0?
Unstructured data is everywhere; it is everything in the organization that does not appear to have a highly structured definition. Experts in the industry have multiple definitions for unstructured data, but the good news is that most of these definitions coincide with one another.
Common unstructured data sources include:
- Documents: Word documents, PowerPoint presentations, newsletters, source code, hard-copy
documents
- Images and graphics
Common sources of semi-structured data sources include:
- E-mails
- TCP/IP packets
- Images and graphics
- Zipped files
- Binary executables
- Documents (all listed previously)
Is Unstructured Data an Oxymoron?
I believe the physical term may be an oxymoron; however, the use of the term to classify
information that we do not house currently within an RDBMS, CSV or MS-Access system certainly lends
itself to discussion. Here’s a quote about unstructured data that is interesting:
The point in this series isn’t to debate the definition of unstructured data, but rather to clarify the term and its utilization in classifying that which we do not already embed within our traditional data warehousing solutions. To the blogger’s point – the data itself is, in fact, structured – it all depends on what analysis tool is used to scan the data set.
Context is what we’re after. Maybe we can’t achieve true context, but we can assimilate and establish cross-references, indexes and relevance ratings to all the content outside our traditional data warehouse (that usually resides within a relational database system). However, for this article, I will not dive further into context or meaning, except to offer an oversimplified hypothesis of what I think context might be: context is the nature of formulating understanding of content and placing content within a known set of bounds from which to interpret specific results and infer particular meaning.
Why are These Lists so Short?
Well, they identify a whole range of items that exist within the enterprise. Rather than
list specific types of items (as I began doing in the semi-structured list), I decided it would be
best to classify unstructured data. However, you may have already realized that everything that’s
on the unstructured list is also on the semi-structured list.
Why is Unstructured Data also Semi-Structured?
Every piece of data exists in one form or another under a given structure. Somewhere in
the Word document, even in the text file (under the covers), there’s a structure to hold the bits
and bytes together as a unit of work. Images have specific image converters, which will take images
from one semi-structured format to another. Word documents and PowerPoint files have specific
structures to enclose their bits and bytes as well.
Where is Unstructured Data?
The unstructured data itself is the actual image, or the actual sentences, words and
alphanumeric characters held within the documents. This is where the knowledge is buried. This is
where the 80% of corporate information is stored. Functionality or interpretation of the
unstructured data is what makes it important – how it’s linked in with the existing structured
information.
Of course, this begs the question: what is structured information? In general, structured information is that which is defined on a cell level by column and data class. For instance, first name in a database – in the context of a table called Customer, provides both structure and basic meaning to the information it contains.
The chasm between structured and unstructured knowledge is what must be crossed in order to make sense of an organization’s full data set. Indexing, ranking, scoring, tagging and maintaining edit trails are all parts of linking “unstructured” information into a structured world. These actions usually require a different approach than that of the standard relational database world.
Figure 1
Unstructured or Semi-Structured Data in the DW 2.0 Architecture
In Figure 1, based on Bill Inmon’s DW 2.0 specifications, we’ve connected all possible
data sets to the data warehouse. This includes active, batch-oriented, documents, transactions,
traditional data sets, indexes, rankings and so on. We’ve connected this enterprise data warehouse
to the enterprise service bus – which, of course, connects it to all the source systems and all the
other technical components that it must interact with.
Why are there Separate Stores in DW 2.0?
The first reason is volume, the second reason is response time and the third reason is establishing
context. In order to establish context or link the information at the right place, a set of “active
tools” must be enabled. The tools along the top are all active – and are running continuously on
streams of data. They are also utilizing live indexes that reach out to machines to watch for edits
to documents, changes to images and newly arriving e-mails. If the current data warehouse is 456TB
and it’s only 20% of a company’s total data assets, imagine how large the information store would
need to be to house the other 80%.
The active tools run 24x7x365 to monitor, authorize, and manage external and internal content. External content can exist on PCs, desktops, mobile devices and the Internet. Not all content will need to be stored locally (within the confines of the physical data warehouse); however, some of the content must be stored locally for mining and correlation analysis.
Life Cycle
What does the life cycle look like compared to the traditional corporate information factory (CIF)?
We’ve done our best to bring you an updated life cycle image based on an older version of the CIF,
and I believe Figure 2 represents the necessary components appropriately.
As you can see, there are different types of sources: external, document management, Internet, desktops, servers (e-mail, etc.), image farms and so on. The nature of acquisition or the “ETL” of the unstructured world is: data location, indexing, ranking and possibly use of localized storage (archival storage). The data warehouse becomes a “term and index warehouse,” with pointers to the documents. The operational data store becomes an operational index store – with active data, active rankings and active relevance ratings. This data is updated every time a monitored document is edited.
What Does This Mean to Data Delivery?
Finally, delivery or the business intelligence (BI) of this world is the “Document Correlation and
Delivery” mechanisms. These are not usually your traditional BI reporting tool sets. These tools are more akin to Inmon
Data Systems, Google Search Appliances and the like. However, one thing we will begin to inherit
from the traditional enterprise data warehouse (EDW) space is the notion of analytics and
predictive analytics. Governance, data stewardship, metadata and management will all take a front
seat. Why? Because privacy is at stake, legal issues must be resolved and the company executing on
an unstructured data EDW – or UDW – must begin to show due diligence in collecting information from
company assets, oftentimes without the knowledge or approval of its employees.
The analytics will change – instead of pivoting data, we will play what-if games on context and spatial relationships. Data visualization will move to the forefront of importance. Land-mapping of documents and their relevance to the business question being asked will become standard. The analytics will provide insight into term relationships, and possibly snippets of multiple documents that show some context to the terms being indexed or searched.
The “term exploration warehouse” will become the data mining center of the future – allowing the contextual analyst to play with the relationships, meanings and interpretation of the documents and e-mails. There will be more on this as we develop this series.
Conclusions
As stated, governance and compliance, along with metadata and business process flows are
of importance here. It is no longer good enough to match existing “machine / operational”
processing to business requirements (while it doesn’t go away, the nature of this task changes). DW
2.0 includes governance, compliance and metadata by nature. The importance of these tasks is forced
by the access of unstructured source data resources. There are questions of ownership, ethics,
quality, understanding and, most of all, interpretation. What will this data be used for? Why is it
being collected? How will the business weed out personal information or “commentary filled with
emotion” from fact?
These are the true questions within a UDW, and as we move forward in this series of articles, we hope to uncover some basic starting points for organizations struggling to answer these questions. In the next article, we dive a little deeper into the technical nature of unstructured data and take a look at form versus function along with utilization and interpretation. For now, I welcome all feedback and comments you may have.
Learn More
Learn how to take full advantage of previously untapped sources of data by incorporating data from
traditional sources as well as from text documents, e-mails, images and articles. Unlock the wealth
of valuable information in your organization and deliver an increased data analysis capability.
Enhance the operational and analytical capability of your organization by attending the DW 2.0
Conference that features the architecture developed by the "father" of data warehousing – Bill
Inmon. Don’t let an outdated architecture impact your organization’s performance. Learn from
experts, including Bill Inmon, Joyce Norris-Montanari, Sid Adelman and many more, why the
architecture for the next generation of data warehousing is a more effective business intelligence
tool. Click here for more information or to
register.
Data Management Strategies for the CIO
Join the conversationComment
Share
Comments
Results
Contribute to the conversation