XQuery and XML data: DB2 helps manage the era of unstructured data

The arrival of IBM XQuery and XML data is simplifying users' decisions in this era of unstructured data, just as the arrival of the relational database and SQL simplified IT buyers' decisions twenty years ago.

The information mix of today's typical large enterprise is strikingly different from that of even five years ago -- and so IT should consider new approaches to accessing that data.

Five years ago, OLTP and data warehousing -- the key mission-critical data-management tasks of the typical large enterprise -- at least half the time involved "dynamic" structured (relational) data that, once stored, was frequently changed. Although data warehouse data changed less frequently, with daily or weekly refreshes, it reflected the OLTP data, which was updated much more frequently.

Today, static event data ("write once read many," or once created, not modified), much of it text or rich-media, is driving today's tremendous growth in data volume. As a result, up to 85% of today's data is static and semi-structured or unstructured.

The majority of today's data management systems use SQL (a transactional language designed for relational data) and are "pure" relational databases, which were originally designed for, and are still optimized for, dynamic, structured data. This transactional language and these relational databases are not the best solution for static, semi-structured or unstructured data.

Instead, users need databases that store semi-structured and unstructured data as XML messages, and access it via the XQuery transactional language. XML is today's de-facto standard for storing semi-structured and unstructured static data, and XQuery is becoming the de-facto standard for accessing XML data. This effectively partitions the enterprise's data into dynamic/structured and static/non-structured data, and applies the data-management technologies appropriate to each. By handling the "information mix" in this way, IT can achieve major benefits for the enterprise:

  • Increased overall performance and scalability will lead to greater user satisfaction, and hence better bottom-line results (e.g., more sales).
  • Increased scalability using existing resources will cut IT costs.
  • Separating management of static/non-structured from dynamic/structured data will put users' data management more in line with today's storage management systems, yielding smaller backup windows, higher I/O performance, and more rapid disaster recovery.

Data Types in the Information Mix

There are several useful ways to categorize today's data. The first is by the format of the data:
  • Structured data is numeric or with a simple, efficient alphanumeric structure. OLTP and data-warehouse applications typically involve structured data. Most data used by relational databases is highly structured.
  • Unstructured data consists of a large piece of data with no internal structure (e.g., a movie), or a highly complex combination of simple data types (e.g., a car designer's diagram). Web graphics is unstructured data. Object-oriented or XML databases often handle unstructured data.
  • Semi-structured data (e.g., text) is intermediate between structured and unstructured data. Many word-processing "flat files" are semi-structured data. Both relational and other types of databases handle semi-structured data. Unstructured plus semi-structured data comprise "non-structured" data.
Another way of categorizing data is by its age:
  • Current data often needs to be accessed rapidly (e.g., for the needs of the real-time enterprise or for recovery from disasters), and is stored as close to main memory as possible.
  • Aging data (e.g., e-mail that is more than three days and less than one year old) may need to be accessed as part of large queries, but is more rarely used; and it is often stored on disk.
  • Archived data is usually more than one year old. It is rarely accessed except for business-compliance needs such as Sarbanes-Oxley.
A third way of categorizing data is by its rate of change:
  • Dynamic data frequently changes (e.g., customer records).
  • Static data, once stored, never changes (e.g., log files and records of events).

The Changing Mix of Static/Non-Structured and Dynamic/Structured Data Over Time

The First Era of information (until the late 1970s) typically involved small amounts of primarily textual (semi-structured) static data, stored in flat files. The file-handling systems of this Era focused on accessing flat-file data. Data management systems such as IMS and VSAM were scalable (although with much effort) to handle high transaction rates (e.g, Fastpath) on static data, but were less effective for dynamic data and large data volumes.

In the Second Era of information, enterprises discovered that they could gain competitive advantage by effective order entry and ERP (Enterprise Resource Planning). These systems involved a constant stream of dynamic, structured data -- such as airline reservations -- that was much larger than the flat-file data of the First Era. Relational databases such as IBM DB2 (highly effective at accessing and updating large volumes of dynamic data) took over the job.

In the Third Era of information, data warehousing and Business Intelligence siphoned data from OLTP systems -- customer buying records, inventory and delivery management, airline reservations, and similar types of data -- and used it to gain competitive-advantage insights via data mining. Data warehousing therefore began by downloading OLTP systems' dynamic, structured data -- although when this data is in a data warehouse the data does not change until a daily or weekly "refresh".

In today's Fourth Era of information, OLTP and data-warehouse systems take an increasing proportion of their data from applications and automated systems rather than users, and those applications feed them data that has become predominantly semi-structured or unstructured and static -- telecommunications network records, materials tracking using bar-code data, automated e-commerce systems for online buying, health care records including X-rays, and so on.

By some estimates, as much as 90 % of today's OLTP and data warehouse data is static and as much as 85 % of all data is unstructured (also, most relational data is structured, almost all relational data is dynamic, most non-structured data is static, and most static data is non-structured). The result is a large and increasing gap between the potential ability of data management systems to handle and scale with the new "information mix" and their actual performance, exacerbated by rapid growth in the total amount of data stored in mission-critical OLTP and data-warehouse databases.

The Value of XQuery

Both SQL and XQuery are similar transactional languages, although they have very different pedigrees. SQL grew out of the formal mathematical definition of relational databases, and became the de-facto standard for transactions on relational databases especially because it allows concise queries that require little bandwidth when sent over networks. XQuery grew out of efforts to access Web documents stored first in XSL and then in XML (eXtended Manipulation Language) format; it is an ongoing standards effort by the W3C (World Wide Web Consortium).

Where SQL is designed for, and allows only, querying against relational data, XQuery allows SQL-like querying and updating against disparate data types. XQuery also permits data to be transformed into a common XML format during querying, as well as result formats to be generated suitable for applications and end users. Therefore, XQuery is especially appropriate for queries against rich-media and file as well as relational data, or to access data in XML format in an operational data store (ODS).

The last two years have seen widespread adoption of XQuery, first in EII (Enterprise Information Integration) tools such as IBM Information Integrator, and then in XML databases such as Ipedo. Now, enterprise databases are adding XQuery to their capabilities as they allow non-structured data into their data stores.

XQuery is thus the "wave of the future" for static/non-structured data. It typically performs better than SQL on these types of data and is especially well-suited to handling Web data.

What Storage Vendors Are Doing

Database technology is not the only area in which there is a clear focus on handling the new information mix. A clear indication of storage vendors' key role in implementing new technologies for the new information mix is the "5-pool approach" pioneered by storage vendors such as Maxtor. In place of the traditional "disk, tape, offline tape" storage hierarchy, these vendors would put a "nearline disk, midline disk, nearline tape, midline tape, offline tape" hierarchy. Midline disk and midline tape would typically contain static data, allowing key dynamic data to be accessed and backed up more rapidly, while saving money on the midline disk and tape.

The 5-pool approach is also part of a broader effort on the part of most major storage vendors such as IBM and EMC, called Information Lifecycle Management (ILM). This seeks to develop mechanisms to place data in the correct "pools" and move it from pool to pool as the data ages, as part of the "lifecycle" of a particular piece of data (note that even structured data, if preserved, eventually becomes static and must be archived). The 5-pool approach and ILM should allow users to separate dynamic/structured data from static/non-structured data much more readily, as well as improve data archival.

How DB2 9 Fits Well With XML and XQuery

IBM's announcement on June 8, 2006 of the forthcoming launch of DB2 9 (originally code-named "Viper") is in many respects today's typical database vendor announcement — that is, there is an extensive list of highly technical added features, often aimed at the needs of particular users with particular transaction patterns.

However, the sum of these features signals a marriage of today's relational mission-critical data with the oncoming world of data formatted as XML messages and accessed by XQuery. DB2 9 also includes extensive performance improvements for relational data. This, in turn, means that DB2 9 can handle both dynamic/structured and static/unstructured data with unprecedented performance.

DB2 Before Viper

DB2 UDB 8.2 is a full enterprise relational database, including support for two-phase commit, SMP (symmetric multiprocessing), clustering, Extenders to support rich-media and complex data types, and remote/GUI (graphical user interface) database administration utilities. For connectivity to other data sources, DB2 UDB provides ODBC, JDBC, SQLJ, OLE DB, and X/Open CLI APIs. For programmers, DB2 UDB offers a visual "stored-procedure builder" for Java or SQL stored procedures, client-side code, and user-defined functions.

In accordance with IBM's "autonomic computing" initiative, DB2 UDB has automated its administrative functions extensively, including a Design Advisor for design time database optimization, automatic statistics collection for automated query performance improvement, a Configuration Advisor for tuning the database at installation, and a Health Monitor for overall DB2 UDB administration. DB2 UDB includes extensive Web-service support, plus the ability to show itself to outside programs via SOAP (Simple Object Access Protocol) as either a Web service provider or consumer.

Buyers of DB2 UDB (the Linux/UNIX/Windows version) can choose from several editions: DB2 Personal Developer's Edition, Data Warehouse Editions (including data-warehouse administration, query management, data mining, and analytic features such as DB2 Cube Views), Universal Developer's Edition, Personal Edition, Workgroup Server Edition, Workgroup Server Unlimited Edition, Enterprise Server Edition, Express Edition (part of IBM's medium-sized business initiative, with enhanced automation of installation and administration), and DB2 UDB Express-C (an open-source version of DB2 UDB Express Edition without replication and Extenders).

IBM also offers DB2 Everyplace for mobile devices, including synchronization with enterprise data sources. In contrast to IBM's major database competitors, IBM's Editions share functionality, but scale differently, with the buyer specifying the degree of maximum scaling via the Edition's licensing terms. DB2 UDB also offers a range of tools and utilities for application development using DB2 Extenders (built-in functions that "extend" DB2 UDB for key functions such as spatial analysis), and an attractive suite of data administration, performance, application management, and recovery and replication tools.

Complementary products and options include WebSphere Information Integrator 8.2 (which can handle distributed and heterogeneous data SQL joins, reads, selects, and single-site updates), data replication software, DB2 Data Links Manager to manage external files, DB2 Net Search Extender to allow high-speed text search via in-memory searching, and QMF, a decision-support tool supporting querying, charting, and reporting. DB2 Warehouse Manager (including DB2 Query Patroller) allows management of large data warehouses, and DB2 Intelligent Miner Modeling, Visualization and Scoring provide data-warehouse data mining. DB2 Spatial and Geodetic Extenders provide geospatial data analysis; DB2 Connect allows DB2 UDB to interoperate with other databases across platforms; WebSphere Studio and IBM's Rational portfolio provide powerful application development tooling, as does Eclipse, a widely-used open-source development "framework"; and WebSphere MQ provides messaging and assists in event publishing.

DB2 UDB's scalability prowess makes it especially appropriate in large-scale OLTP, data warehousing, and operational data stores that perform both queries and updates. DB2 UDB is also highly useful where IT buyers require IBM platforms (or connectivity to them), where major projects require sophisticated vendor services, and where mainframes or i5 platforms form part of the solution. Because of IBM's cross-platform support, DB2 UDB is often useful in cross-platform situations.

IBM sells DB2 via its services arm (both for IT and business consulting) and its huge global sales force. In recent years, IBM has also emphasized and increased DB2 sales to ISVs and indirect channels. IT buyers examining DB2 UDB should also note its combination of a broad array of parameters that can be tuned for better performance and robustness in particular situations, plus its automation features that reduce the complexity of tuning and maintaining an enterprise database.

The New Features of DB2 9

IBM identifies XML and XQuery support as one of the key areas in which DB2 9 extends DB2.

Similar to Oracle, IBM enables coexistence of relational and XML data (IBM calls its approach pureXML):

  • XQuery transactions can be performed on XML and relational data.
  • SQL transactions can be performed on XML and relational data.

    What differentiates IBM from Oracle is that IBM makes a great effort to associate data-type-specific metadata and indexes with each set of XML data, rather than treating it as an undifferentiated mass of objects. This, in turn, allows IBM to optimize transactional performance for each type of XML data. Also, DB2 9's development support includes XML-data-access-code development support.

    Conclusions

    Just as the arrival of the relational database and SQL simplified IT buyers' decisions twenty years ago, the arrival of XQuery and XML data is simplifying users' decisions now. That is, users have a clear plan to follow:

    1. Separate static/non-structured and dynamic/structured data.
    2. Use a relational data store and SQL with dynamic/structured data, and an XML data store and XQuery with static/non-structured data.
    3. Use ILM to move structured/relational data to a static data store such as an archive as it ages.

    Moreover, the ascendance of XML and XQuery means, hopefully, that the object-relational database wars, which have been going on for 20 years now, are over, allowing both sides to declare victory. In fact, the customer is the real victor, as the same database can finally support both relational and unstructured/semi-structured data, almost equally effectively. This will be useful in areas such as master data management and operational data stores, where related business-critical relational data and rich-media XML data can exist cheek by jowl for rapid identification of relationships in the "on demand" business.

    The major long-run news in the DB2 9 announcement is DB2's new ability to support users following this plan, by marrying a relational and XML database. In other words, just as XQuery and XML have become de-facto standards for data federation and Enterprise Information Integration (EII), they are well on their way to becoming de-facto standards for handling non-relational data, and IBM is taking a leadership role in that effort with Information Integrator and DB2 9.

    About the Author

    Wayne Kernochan is President of Infostructure Associates, an affiliate of Valley View Ventures that aims to provide thought leadership and sound advice to both vendors and users of information technology. This document is the result of Infostructure Associates sponsored research. Infostructure Associates believes that its findings are objective and represent the best analysis available at the time of publication.

    This document is subject to copyright. No part of this publication may be reproduced by any method whatsoever without the prior written consent of Infostructure Associates. All trademarks are the property of their respective owners. While every care has been taken during the preparation of this document to ensure accurate information, the publishers cannot accept responsibility for any errors or omissions.

  • This was first published in June 2006

    Dig deeper on IBM DB2 management

    Pro+

    Features

    Enjoy the benefits of Pro+ membership, learn more and join.

    0 comments

    Oldest 

    Forgot Password?

    No problem! Submit your e-mail address below. We'll send you an email containing your password.

    Your password has been sent to:

    -ADS BY GOOGLE

    This Content Component encountered an error
    Close