This article originally appeared on the BeyeNETWORK
By submitting your personal information, you agree that TechTarget and its partners may contact you regarding relevant content, products and special offers.
There is a recent trend by dbms vendors to extend the data warehouse beyond its natural architectural boundaries. In particular, several dbms vendors have –
- combined real-time online processing and data warehouse processing
- combined OLAP processing and data warehouse processing
- combined exploration and data mining directly into the data warehouse.
The reasons for these different dbms strategies are very understandable. The more one can incorporate into a single data warehouse, the more hardware it can sell. If I were a dbms vendor, I would try to do the same thing. But there are some very good reasons why these attempts to extend the data warehouse are not destined for massive success.
The reason why this attempt at architecturally extending the data warehouse does not have a bright future can be divided between two categories – technical and organizational.
Combining real-time online processing and data warehouse processing into the same box can be done, but only to a limited extent. If only a limited amount of processing is going on, then a limited amount of mixture can be done. But with a greater amount of processing and a complex workload, then it is best to divide the data warehouse into a data warehouse and an ODS. The ODS is where real-time processing occurs.
What is wrong with mixing a lot of real-time processes with the data warehouse? The first difficulty lies in the workload. When real-time processing is mixed with DSS processing as is typical of the data warehouse, the workload is a mixture of short running processes and long running processes. Consistent response time suffers at the point that the workload becomes significant.
But there is another difficulty and that is that often times in real-time processing, there is a desire to do real-time updates. There is a problem with the underlying infrastructure for the guarantee of transaction and data integrity. The overhead that is required to enforce the integrity of the data causes the rest of the data warehouse processing to suffer.
Combining OLAP and data warehouse processing has some of the same technical difficulties. There is a mixed workload when the two worlds become entangled. But there is something else fundamentally different as well. That difficulty is that the data structures for the data warehouse are truly opposite to the data structures that are optimal for the multi dimensional OLAP environment. The data structure that is optimal for the data warehouse is very, very different than the data structure that serves the OLAP environment. Therein lies the major difficulty with trying to combine the two environments.
There is another related major difficulty in that the OLAP environment is designed to serve a single community of users while the data warehouse environment is designed to serve the broadest set of users – all users. It has long been known that optimizing the data base design of the data warehouse for one set of users hurts everyone else.
Combining exploration and data mining processing with a data warehouse can be done, but only to a point. Eventually, the resources required for data mining and data exploration are such that they overwhelm the processor on which they reside. There is an old way of expressing this relationship –
- if you want to do exploration processing on a data warehouse once a year, go ahead,
- if you want to do exploration processing on a data warehouse once a quarter, you shouldn’t have much of a problem,
- if you want to do exploration processing on a data warehouse once a month, you need to be very careful,
- if you want to do exploration processing on a data warehouse once a week, you better look elsewhere,
- if you want to do exploration processing on a data warehouse once a day, no way,
- if you want to do exploration processing on a data warehouse once an hour, spruce up your resume.
The resource contention problems become so severe in the extreme cases that there simply is no way to mix the two environments.
Furthermore, consider this: In many exploration cases, you do not want fresh data being updated into your base of data. In many cases you need to “freeze” the data so that an iteration of analysis can be compared to another iteration of analysis with no regard for the changing of the foundation of data. In this case you CANNOT run an exploration project from a data warehouse.
Then there are more powerful reasons why a more distributed approach to data warehousing makes technological sense.
But there is another class of reasons why a data warehouse should not be combined with other architectural forms. The reason is due to the budget and politics of the organization.
In most cases there is an organizational separation that accompanies the different department structures. For example, the sales department has its OLAP infrastructure. The finance department has its OLAP structure. The marketing department has its OLAP structure, and so forth. Each of these organizations cherishes its autonomy. These organizations have their own hardware, software and data bases. These organizations have their own analysts and programmers. The last thing these organizations want to do is relinquish control back to a centralized IT organization. For these reasons, these organizations do not want a central control of anything, certainly not within their OLAP environment.
The same can be said of statistical analysis. The statisticians are a different breed. They are not just analysts; they are a specialized type of analyst. The way they think, the way they design their systems, the way they do their analysis, is very different from anyone else in the organization. The last thing these analysts want is to have a large monolithic organization controlling their infrastructure.
There are then some very good reasons why the definition of the data warehouse centric CIF – with its separate and distributed components – is the standard for information systems across the industry.
Bill is universally recognized as the father of the data warehouse. He has more than 36 years of database technology management experience and data warehouse design expertise. He has published more than 40 books and 1,000 articles on data warehousing and data management, and his books have been translated into nine languages. He is known globally for his data warehouse development seminars and has been a keynote speaker for many major computing associations. Bill can be reached at 303-681-6772.
Editor's Note: More articles, resources and events are available in Bill's BeyeNETWORK Expert Channel. Be sure to visit today!