This article originally appeared on the BeyeNETWORK.
Consider the peculiar case of historical data. If you go back to the days of OLTP systems, you find that historical data was very unpopular. The OLTP systems programmer did everything possible to minimize historical data. The systems programmer jettisoned historical data as fast as possible.
Why? They minimized historical data because it slowed down system performance and the OLTP systems programmer felt that it was his or her job to get maximum transaction performance at all costs. It is with this mindset that the OLTP systems programmer went about purging historical data as quickly as possible.
Now consider the data warehouse administrator. He or she loves historical data and thinks there is no such thing as too much. The data warehouse administrator keeps historical data and in doing so is positioned to please data analysis requests that come in on an almost random basis.
Therefore, the pendulum has swung from the OLTP systems programmer to the data warehouse administrator—from no historical data at all, to as much historical data as you can get.
The pendulum is about to swing again.
Is it true that a data warehouse should contain as much historical data as can be obtained? The answer is no. A data warehouse certainly contains historical data, but there comes a point where it is not proper to keep historical data in the data warehouse. That point comes when the probability of access of historical data drops to a very low number.
As a rule, the probability of access of historical data depends on its relevance. In other words, the newer data in the data warehouse has the highest probability of access and the older data has the lowest probability of access. Therefore, when data ages beyond five years or so, most organizations purge their data warehouse and put the purged data in archival or near line storage.
This strategy works fine for most organizations, but it is a crude strategy that can be refined.
Consider some cases where purging data by date does not work. Consider the case where there are columns of data that are never used, regardless of the date. In this case the data warehouse is storing data that is never used, which is a waste.
Or, consider the case where certain rows of data inside the data warehouse are never used regardless of the date. Suppose a bank has a data warehouse of customer data (a likely scenario). Now suppose that bank accounts from California are accessed frequently. People in California do a lot of banking business. Also suppose that people from Arizona and New Mexico rarely access their data in the data warehouse. The data is kept inside the data warehouse regardless of the state the person is from. And from a data warehouse standpoint, this is a wasteful thing to do.
So if the crude measurement of purging by date doesn’t seem to be working, what can a data warehouse administrator do? The answer is that there are data warehouse monitors that can tell you down to a gnat’s eyelash what data is being used and what is not being used. These monitors can tell what columns are in use, what rows are in use, what dates are being accessed, etc.
Ironically, the data warehouse monitors that work best are not those provided by the dbms vendors. For whatever reason, the dbms vendors have built their data warehouse monitors into their product. And the resulting overhead consumed by turning the dbms vendors monitor on is simply tremendous—so much so that most shops turn the monitor off at periods of peak processing. And of course, this is when you need the monitor turned on the most. So ironically, most dbms vendors monitors are self defeating. What works best are monitors from third-party vendors that do not soak up system resources.
The pendulum is being pushed back by the cost of the volumes of data that collect in a data warehouse over time ad infinitum. Those volumes of data are expensive and the volumes of data hurt performance.
Like all things in life the pendulum swings back and forth and it is time for the data warehouse historical data to push back (a little bit) toward the OLTP systems programmer.
Bill is universally recognized as the father of the data warehouse. He has more than 36 years of database technology management experience and data warehouse design expertise. He has published more than 40 books and 1,000 articles on data warehousing and data management, and his books have been translated into nine languages. He is known globally for his data warehouse development seminars and has been a keynote speaker for many major computing associations. Bill can be reached at 303-681-6772.
Editor's Note: More articles, resources and events are available in Bill's BeyeNETWORK Expert Channel. Be sure to visit today!