The original intent of the data warehouse was to segregate analytical operations from mainframe transaction processing in order to avoid slowdowns in transaction response times, and minimize the increased CPU costs accrued by running ad hoc queries and creating and distributing reports. Over time, the enterprise data warehouse became a core component of information architectures, and it's now rare to find a mature business that doesn't employ some form of an EDW or a collection of smaller data marts to support business intelligence, reporting and analytics applications.
But as organizations increasingly adopt newer technologies -- Hadoop clusters, NoSQL, columnar and in-memory databases, data virtualization tools -- questions are being raised about the future relevance of data warehouse software in enterprise IT infrastructures. Some people have already started to ring the death knell for the EDW, predicting its impending demise at the hands of big data systems and high-performance computing platforms.
And those other technologies do offer some advantages over the traditional data warehouse. Hadoop is a distributed processing framework that promises high levels of performance scalability using low-cost commodity hardware. In-memory databases and columnar software geared to analytical uses can also dramatically increase processing performance. NoSQL databases bypass the schema strictures of mainstream relational database management systems and provide wider flexibility in developing applications. Layering a data virtualization tool on top of systems enables on-the-fly integration and in some cases also allows transaction processing and analytical applications to simultaneously touch the same data sets; both of those capabilities can reduce the need to extract and load data into a segregated warehouse.
Look under the covers on IT costs
Yet the reports of the death of the data warehouse may prove to be greatly exaggerated. From a financial perspective, the motivations to migrate to new technologies must be balanced with the merits of continuing to leverage existing investments in EDW technology that's already in production use -- and still producing the data goods. It's also useful to point out that, in order to be realized, the perception of the value of radical change sometimes requires a greater investment than originally anticipated.
As an example, consider infrastructure costs. There's an implication that downloading and installing open source software such as Hadoop on a homegrown setup of interconnected commodity computing systems provides a low-cost alternative to the high-end servers or mainframes that typically host data warehouses. While it's possible to create a test-bed environment using that approach, it takes more for a Hadoop cluster to deliver on its performance promises in production applications: An organization must invest not only in new technology but also in skilled staff resources to deploy and manage the platform.
Hadoop's potential for storage elasticity also suggests potentially unlimited disk space. But it isn't always smooth sailing on the Hadoop data lake. Realistically, the availability of a seemingly inexhaustible amount of storage may encourage users to save data unnecessarily, rapidly filling the available disk space with a broad array of unstructured (and ungoverned) data that may not have any real business value.
A blended approach to managing data
Some other key facts we should recognize:
- Organizations that have invested significant amounts of money and effort in their data warehouse environment would need to see a sizable ROI projection for a Hadoop or NoSQL deployment before deciding to completely rip out the EDW and replace it.
- Because of the nature of open source development, technologies like Hadoop and the various tools surrounding it still have some time to go before they reach the level of maturity that data warehouse software has attained -- if they ever get there.
- Even though components of the Hadoop ecosystem are intended to replicate the dimensional schemas and interactive analytical queries supported by data warehouses, it remains largely batch-oriented for the near term.
- Many business users are still dependent on the reports and ad hoc query capabilities of their trusted data warehouses.
Of course, you can't ignore the availability of a parallel processing platform that can run complex computational algorithms to analyze massive volumes of data in ways that can't be done using a system geared to dimensional slicing and dicing. The results of those kinds of analytics applications can be used to augment the data in an enterprise data warehouse, enhancing customer profiles and enabling more informed business decisions to be made.
That suggests that while Hadoop, NoSQL and other alternative technologies are likely to emerge as significant components of BI and analytics architectures, the optimal strategy will blend them with the EDW. It isn't time to close the door on the data warehouse just yet.
About the author:
David Loshin is president of Knowledge Integrity Inc., a consulting and development services company that works with clients on big data, business intelligence and data management projects. He also is the author or co-author of various books, including Using Information to Develop a Culture of Customer Centricity. Email him at firstname.lastname@example.org.
See why consultant Wayne Eckerson says big data vendors shouldn't badmouth the EDW
Get tips from Claudia Imhoff and Colin White on building an extended BI and data warehouse architecture
Find out why consultant Rick van der Lans thinks modern data warehouses are inherently logical