This article originally appeared on the BeyeNETWORK.
So you have built a data warehouse. You have your data marts that you feed. You have your users that regularly access the data warehouse. You have the occasional analyst looking for an obscure scrap of data. Your warehouse started out reasonably sized, and now it is starting to bulge at the seams. Life is good – and reasonably normal.
Then one day a statistician comes knocking on your door. The statistician is excited about all of the data you have and especially excited about the fact that the data is integrated across multiple sources and that the data has been scrubbed. Throw in the fact that the data exists at the lowest level of granularity, and the statistician is absolutely exultant.
So the statistician goes back into a huddle and designs an analysis. Then the statistician submits the analysis on Friday of a three-day weekend. To do the analysis, the statistician has looked at a broad sweep of detailed data – five years’ worth of sales, perhaps. On Tuesday, the statistician comes in and looks at the results. The statistician is really excited. The statistician states, “We need to include Canadian and Mexican sales in the results to get a much more accurate reading. The first analysis was good, but we can do better.”
So the statistician submits a job to run the next weekend. The only problem is that a special ETL process has been designated to run that weekend. The statistician slinks away to a corner and waits yet another week.
The subsequent weekend rolls around. The statistician is ready to roll. But, on Friday afternoon, an error is found in a database load, and an entire weeks’ worth of data has to be reprocessed. The statistician understands, but is a little more than angry. However, the statistician impatiently waits his turn.
Finally, the weekend rolls around, and it is the statistician's turn. The analysis is submitted. The statistician comes in on Monday, and the results are really good. However, in discussing the results with the marketing organization, it is decided that a different type of sale is needed. The statistician quickly adjusts the parameters and is ready to resubmit.
The problem is that the database administrator tells the statistician that the earliest that the next analytical run can be scheduled is a month from the current date.
The statistician and the marketing department are really miffed. They claim that the bottom line of the business is being impacted by the inability to run statistical analysis as desired. The database administrator points out that if the large statistical analysis is submitted in the middle of the data warehouse day, that service will cease to be acceptable for all the regular users of the data warehouse.
Who is right and how can this circumstance be resolved?
The answer is that both the statistician and the database administrator are right. But, there is a solution.
The Exploration Warehouse
Enter the exploration warehouse. The exploration warehouse is a physically separate data warehouse from the enterprise data warehouse. The primary objective of the exploration warehouse is to provide a place where the statistician can do his/her processing as frequently as the statistician likes, with no regard for the regular users of the data warehouse. Because the exploration warehouse is necessarily physically separate from the enterprise data warehouse, there can be no performance conflict between the two environments.
There are other good reasons for the separation of the exploration warehouse and the enterprise warehouse. One of these reasons is the need to freeze data in an exploration warehouse. In an enterprise warehouse, data is normally entered as soon as it is ready. In doing so, the enterprise warehouse contains the freshest data possible. However, in the exploration warehouse, it is often necessary to freeze the data. This is because when one iteration of processing is done and is compared with the results from a previous analysis, the data needs to be frozen. If the data is not frozen, then the results of one iteration of analysis may be due to a difference in data, not a difference in algorithms or analysis.