This article originally appeared on the BeyeNETWORK.
One of the most important potential uses of the data warehouse is to serve as a basis for statistical analysis. Statisticians have found that having an integrated, historical base of detailed data that was usually scrubbed was ideal for doing statistical analysis. Such a foundation served the needs of the professional statistician nicely, meaning that the statistician could spend his or her time as an analyst and not a data administrator.
Indeed, the data warehouse community thought that the needs of the statistician could be served directly from the data warehouse. Once we had built the data warehouse, we could just let the statistician analyze the data. While it is possible to let the statistician explore the data found in the data warehouse, there are many reasons why such a strategy is hardly optimal. Put simply, there are many more reasons for having a separate collection of data for the statistician than for having the statistician operate directly against the data warehouse.
Some of those reasons include:
- The need to freeze data being used in the statistical collection of data. Statistical analysis is different from any other kind of analysis occurring in the data warehouse centric corporate information factory. In any other place having the freshest data possible is desirable. But in doing statistical processing, it is desirable to occasionally freeze data. When the statistician is doing heuristic analysis, if fresh data is being pumped into the statistician’s collection of data, the statistician cannot tell whether new analytical results are functions of a change in algorithms or changes in data. For this reason, it is occasionally necessary to freeze data that is being statistically analyzed.
- Even with the most sophisticated operating system, when statistical analysis is done along side other data warehouse processing, there develops what can be termed “workload friction.” Workload friction is what happens when short-running queries are mixed with long-running queries. Statistical processing is almost always characterized by very long-running queries. There is an old saying that illustrates workload friction. How fast can a Porsche go? Now how fast can a Porsche go in rush hour traffic on the Los Angeles freeways? The answer is—a Porsche in rush hour can go only as fast as the vehicle immediately ahead of it. And if that vehicle is a loading van going 25 mph, then the Porsche can only go 25 mph. When heavy statistical processing is mixed with other data warehouse transactions, the entire system comes to a crawl.
- Not all data goes into the statisticians’ collection of data as it exists in the data warehouse. On occasion, there is the need to edit or recalculate data. When this is done the statistician makes sure that everybody is using the same data. Without the creation of these convenience fields, the calculation will have to be done by everybody that uses the data at its most detailed level. By creating a separate source, the statistician can ensure that the level of data is exactly what the statistician needs it to be.
- By channeling data in a separate environment, the statistician has complete control over that environment. The statistician does not have this control when sharing the data in the data warehouse with other users with very different agendas.
- Finally, there is the issue of cost. The machine cycles for the data warehouse users are usually the most expensive machine cycles available. By removing data from the data warehouse, the statistician moves processing to an environment where the cost of processing is significantly less.
These are just a few of the more important reasons why removing statistical data from the data warehouse makes the most sense.
Bill Inmon is universally recognized as the father of the data warehouse. He has more than 36 years of database technology management experience and data warehouse design expertise. He has published more than 40 books and 1,000 articles on data warehousing and data management, and his books have been translated into nine languages. He is known globally for his data warehouse development seminars and has been a keynote speaker for many major computing associations.