News Stay informed about the latest enterprise technology news and product updates.

Data warehousing between a rock and a hard place

The position of this article is a skeptical one that data warehousing, e-discovery (email) and enterprise content (document) management will not converge.

This article originally appeared on the BeyeNETWORK.

The challenge of managing, integrating, and getting value from structured and unstructured data continues to grow almost as rapidly as the data volumes. This challenge is driven by regulatory requirements of e-discovery, compliance, fraud detection, and governance; the business requirements of building and sustaining customer relations, brand development, and market trend analysis, in good markets and less good ones; and the infrastructure requirements of keeping the lights on in the information utility by archiving and restoring lost or damaged files.

The position of this article is a skeptical one that data warehousing, e-discovery (email), and enterprise content (document) management will not converge. These will remain separate markets for the foreseeable future, but with an increasingly broad and deep overlap of infrastructure and application functionality that embraces structured and unstructured data, especially email and the structured transactions that correspond to the messages.

There is such a substantial installed base of data warehousing that bridging the gap between transactional data and archiving, document management, email, and e-discovery will be a requirement for the foreseeable future. The same thing can be said about the giant installed base of document management systems. There are numerous, separate archiving systems targeting different data sources – email, documents, transactions – out there too. Meanwhile, data warehousing appliances are creating new, proprietary silos all the time. But let's not go there right now – just another data point that any convergence will be incomplete – a broad and deep overlap.

However, even if these markets do not converge in the simple sense that one or more of them disappears into the other, the set of requirements that would have to be addressed by a single data store is an impressive but not impossibly high bar to get over: email archive, retention, compliance, backup, search; file archive, retention, compliance, backup, search; data warehousing storage, retention, compliance, backup, search.

The three keys to connect and make intelligible the data from the three different sources are:

  1. Extreme scalability to handle the data volumes – this is where a column-oriented database would come in handy since the storage compaction is intrinsic and prior to the additional compression that could be applied;

  2. Parallel, high performance ETL functionality to load all the data; and finally

  3. Search capabilities that enable high performance inquiries against the data.
    Such unified access to diverse data types, intelligently connected by metadata, is also sometimes described as a “data mashup.”

Such a system is technologically feasible. It perhaps needs to be better known that Sybase IQ/Sun and BMMsoft have put together a product delivering functionality across these three previously unrelated silos – data warehousing, e-discovery (email), and document management – and it is able to be purchased as EDMT Server – a single part number from BMMsoft (EDMT stands for “Email, Documents, Media, Transactions”). The business need is real. Is the product? This is not a rhetorical question. Products and vendors will be judged by the quality and frequency of their customer references. Always make it a point to ask for references and check them out in detail.

This is not a question that can be answered here. Here the purpose is to use the very existence of the offering – that someone saw a market opportunity – to drive a conversation about the market. However, if it functions as designed and gets traction in the market, it could be the lead pin for consolidating a diversity of point products. For example, risk management and compliance offerings include AXENTIS, BWise, Cura, Protiviti, Compliance 360 and IBM, which has at least two offerings – one based on Lotus Notes and one based on FileNet. E-mail archiving is a separate market with separate products. These include Autonomy’s Zantaz, EMC EmailXtenderHP Integrated Archive Platform, OpenText Livelink, and Symantec Enterprise Vault. Document management systems include IBM FileNet Business Process Manager, EMC Documentum, OpenText Livelink ECM and Autonomy Cardiff Liquid Office.

If an enterprise needs workflow, then it will continue to require a special purpose document management system. Workflow was invented by FileNet in 1985, acquired by IBM in 2006, and continues to lead the pack, though plenty of time has elapsed to allow the functionality to be reverse-engineered. If it requires a rule-engine for compliance and governance, then it will need a compliance, risk management and governance system.

Notwithstanding the rich comic possibilities of a Guinness Book of World Records1 for its single image of a petabyte data store using Sybase IQ on Sun servers and storage with BMMsoft ETL, metadata, and search, there are a couple of data points worth noting. This certification was the foundation for the third version of the Sun-IQ-BMMsoft “Reference Architecture,” the official sizing and configuration tool used by Sun. It was audited by InfoSizing, whose Francois Raab signs the TPC benchmarks. The petabyte (+1,000TB) was condensed down to some 260 terabytes, thanks to the single instancing, deduping and pre-compression of EDMT Server and intrinsic data compression of the column orientation of Sybase IQ and additional compression algorithms. This gives support to the contention that as system data points climb the path of extreme scalability, storage technology becomes an increasingly large proposition of the cost and, by inference, an opportunity for extreme savings if column-oriented technology is able to handle it as is the case here. The idea of a unified database (UDB) capable of accommodating all digital data in a single data store still remains a high bar, but the bar is being lowered.

The argument made by BMMsoft is that e-discovery tools miss the relevant transactional data – e.g., stock trades – required to detect fraud; whereas SQL tools miss the relevant emails and documents. Thanks to the metadata layer, the EDMT (email, documents, media, transactions) tools search all the relevant dimensions, enabling the front-end visualizer to display the correlations, connections, outliers, and trends needed to detect fraud, identify a developing security threat, build customer relations in context, or provide a 360-degree view of a hospital patient.

Application innovation will be needed. With the benefit of 20/20 hindsight, it is clear that Basel II and SOX did not work – at least not to the extent that the systematic risks of risky (“toxic”) financial instruments, including risky mortgages, were not visible to the compliance data warehouse and so not effectively guarded against. The generals usually fight the last war. Pricewaterhouse claims to have followed the letter of the law of generally accepted accounting principles in auditing Satyam, though it is hard to fathom how that could have overlooked the missing billion dollars in cash. Price is not like the three person accounting firm that is supposed to have audited Bernard Madoff services. Yet by all reports, a major integrity outage – data, legal, ethical, and business integrity – occurred and executives have now been arrested by the Indian authorities. One obvious take-away? This will likely drive even more requirements for storage as attempts are made, whether wisely or not, to "store the universe" against the repetition of such events in the future as we have seen since September 2008. The recommendation? Get a big one – or, more properly, get one that meets current needs only but that scales up to an order of magnitude beyond what you think you need right now.

Standard relational databases have acquired the capabilities to handle structured and unstructured (including semi-structured) data in the form of XML for the unstructured. DB2 has had such capabilities as part of a single data store since DB2 9.5 with front-end technology leapfrogging ahead – e.g., IBM OmniFind Analytics Edition. Oracle followed suit shortly thereafter, however with a different implementation approach. As is often the case, the technology is out ahead of the applications.
Most innovative technologies start at the margin and below the dominant design of the market leaders (i.e., standard relational databases), and through superior implementation, marketing, delivery and execution, work their way toward the center. Make no mistake – because of innovations in metadata, ETL, and front-end search, such an approach has the potential to execute a pincers movement on data warehousing, document management, email archiving and e-discovery, squeezing the existing market in a positive sense to deliver superior value to customers at increased functionality and reduced prices.
At the risk of mixing the metaphor with which this article began, what happens when an irresistible force meets an immovable object? We are about out find out. The irresistible force of e-discovery, compliance, fraud detection, governance, risk management, and other regulatory mandates is heading straight toward the immovable rock of year-to-year 10% reductions in information technology budgets. In a conversation with me on January 15, 2009, Tom Pacileo of Compass Management Consultants quotes the anonymous client CEO who told him, “We are not dependent on IT. Look, I can go right over here to the printer and produce a hard copy.” Although most CEOs are more sophisticated than that – and keeping the printers running is still one of the keys to IT customer service – this is evidence that the business and IT are still struggling to handle the basics of communication, collaboration, managing costs, and generating competitive advantage. Pacileo goes on to make the point that the current economic dynamics have produced the biggest IT crisis since the Y2K event and that, though all the details are different, it can be leveraged to harness application consolidation, simplification, and improvements. Instead of year-to-year budget cuts, smart companies are identifying projects that generate enough savings that they can pay for themselves – in effect self-funding. And that is the answer to the question – the firm caught between irresistible force and the immoveable object should look for self-funding IT projects. The number of firms with multiple, overlapping systems – data warehousing, ERP, CRM, document management, e-mail, not to mention payroll, HR, and order entry – is significant enough to take another look at working smarter while downsizing.

Lou Agosta is an independent industry analyst, specializing in data warehousing, data mining and data quality. A former industry analyst at Giga Information Group, Agosta has published extensively on industry trends in data warehousing, business and information technology. He is currently focusing on the challenge of transforming America’s healthcare system using information technology (HIT). He can be reached at

Editor's Note: More articles, resources, news and events are available in Lou's BeyeNETWORK Expert Channel. Be sure to visit today!


Dig Deeper on Data warehouse software

Start the conversation

Send me notifications when other members comment.

Please create a username to comment.