Data warehouses and scalability

Is a data warehouse really necessary? What business requirements indicate the need for a data warehouse?

This article originally appeared on the BeyeNETWORK.

The rapid entrance of a number of data warehouse appliance solutions over the past few years has raised some interest in the question of data warehouse performance and scalability. While it is valuable to explore the nature of the performance and scalability issues that drive this segment of the market, there is equal value in evaluating your organization’s true performance and scalability requirements. In other words, it is important to clarify the orders of magnitude based on reasonable benchmarking of actual warehouse demand, as opposed to the I/O channel bandwidth requirements for efficient ETL and information delivery, or the analytical processing requirements that actually are executed at the client application. One might find that in some situations, there might even be a question as to whether a data warehouse is necessary at all! In fact, this question came up in a coffee-break conversation at the most recent TDWI World Conference (San Diego in August of 2007), where a handful of folks were discussing what characterized the business requirements that indicated a need for a data warehouse.

For a moment, though, I want to reflect on the type of work I performed early in my career, which was developing compiler optimization techniques to support high performance, massively parallel computing. Typical compiler optimizations focus on reducing the load of the critical resource, which in the early days was the CPU; reducing the number and complexity of computation was the significant driver, as the amount of time the CPU took to perform additions and multiplications overwhelmed all other aspects of the computing model.

But as CPUs increased in speed and capabilities, both in complexity (e.g., VLIW) as well as simplicity (e.g., RISC), the overwhelming performance factor transitioned away from CPU cycles because they increased in performance to the point where the processor often sat idle waiting for data to be delivered from memory. Witness the development and integration of multiple levels of cache, increase in the number of registers, and the hardware capabilities to stream data in through the memory hierarchy. At that point, the focus of optimization changed. Instead of looking at the performance of the “core,” the need for optimization shifted to the channels, and looking at ways to increase the speed of moving data from its static location into the CPU at a rate that would not allow the processor to delay executing. Of course, with multiple processors touching the same memory resources, the semantics of the compiled program needed to be observed to ensure that race conditions are resolved consistently and read/write dependencies are not violated.

This boils down into two concepts. The first is that no matter how fast your CPU was, its performance was bound by the speed at which data is moved. The second is that consistent and correct execution depended on ensuring coherence of the data across the environment. Now, fast forward 20 years, and this exact situation is analogous to what is occurring in two critical business areas in the data warehousing space. There is a growing number of data warehouse appliance vendors selling solutions that promise high performance data warehousing on top of parallel architectures. The appliance is similar to the CPU, though – it has become the core resource that has been improved in capability, but its results are still dependent on the ability to gain access to the data that is to be warehoused.

So when assessing requirements for evaluating performance and scalability of a data warehouse solution, it will rarer that the key factor is based on the amount of data that will reside in the warehouse, but rather on which type of warehouse architecture best supports the information flow models reflecting how transactional data is transformed into reporting data. On the incoming side, this will probably refer to the number and size of transaction records that are used to support the reporting and analysis needs for each of the contributing business units. On the client side, these requirements are more likely to be based on the number, types, frequency, and complexity of the queries that will feed either the standard set of periodic (e.g., daily or hourly) reports, as well as the ad hoc queries likely to be launched by the business analyst constituency.

In fact, there are evolving solutions that have emerged to begin to address the data dependencies of a warehouse. The clearest example is the concept of an operational data store (ODS), which is a more volatile data collection that supports the currency and timeliness needs of collecting data from subject-oriented operational systems as both a query-able resource and as an interim location for the data that is to be subsequently loaded into a data warehouse. In other words, there are ways today that an ODS is akin to a memory cache, through which the latency associated with streaming aggregations of transactional data into the warehouse is essentially reduced. In addition, line of business reporting and queries can bypass the warehouse altogether and go directly to the ODS.

Another emerging notion is the concept of an ETL appliance – a high performance system designed specifically to support the data extraction, transformation, cleansing, consolidation, and integration needs prior to moving data into the warehouse. This idea pushes the performance question out a little further from the warehouse since, again, it concentrates the computational componentry where it is needed to facilitate data motion. Similarly, this requirement is reflected in the parallel software infrastructures that are de rigueur these days as part of most production-ready ETL products as well as the enterprise information integration (EII) componentry and services that effectively provide virtual cache coherence across the enterprise.

On the analysis side, we see similar innovations, such as streaming through multiple data channels for report and query result delivery, embedded query optimization tools that sit between the business analysis teams and the warehouse, as well as caching the results of common queries to reduce warehouse stress. The results of business analytics can be integrated into the warehouse, yet the intelligence embedded within the operational applications that are driven by these analytics results are more effectively deployed at the operational site, so relying on the performance of a warehouse for real-time analytics is probably not necessary either.

Providing a direct bypass of the warehouse to the ODS for line-of-business reporting recalls the question I brought up earlier: If we have ways to efficiently extract, integrate, and stream data from the original source directly into the standard reporting framework, is a data warehouse really necessary? In some of these situations, the warehouse becomes the bottleneck, thereby becoming more of a performance “red herring” than anything else.

The answer to this brings us full circle, but adds in a different dimension that is driven by business needs, not bulk extracts and transform/loads. If the business productivity and key operational and performance metrics in an organization are aligned across multiple lines of business, then an enterprise data warehouse becomes the repository for the performance models that reflect cross-organizational interactions and productivity metrics. In this case, enterprise analytics must be driven from a shared resource, of which a data warehouse is the natural selection. On the other hand, if the organization’s performance is measured within each line of business, then perhaps individual data marts may be more appropriate, which, in turn, reduces the complexity and subsequently the scalability needs for an enterprise-class warehouse.

David Loshin

David is the President of Knowledge Integrity, Inc., a consulting and development company focusing on customized information management solutions including information quality solutions consulting, information quality training and business rules solutions. Loshin is the author of The Practitioner's Guide to Data Quality Improvement, Master Data Management, Enterprise Knowledge Management:The Data Quality Approach and Business Intelligence: The Savvy Manager's Guide. He is a frequent speaker on maximizing the value of information.

Dig Deeper on Data warehouse software