Place this at the top of all articles: This article originally appeared on the BeyeNETWORK
The other day I visited a data warehouse site to discuss various matters of interest with the customer. The customer was proud that its warehouse has over twenty terabytes of data and is growing at the rate of 70 gigabytes a week.
While the company thought of this as a technological masterpiece, it triggered a thought in my head – who is going to use all this data?
Now that is an interesting question in organizations that have very large data warehouses. Of course the vendor will defend these large data warehouses because they want to sell you more equipment. But at some point don’t you stop needing data?
If you want to find out how much data you need, perform this analysis. Let A be the number of analysts who are using your data warehouse. Let B be the number of queries they do per day. Let C be the number of rows that – on the average – each query consumes. Let D be the average row length in the data warehouse. Now to use these numbers in a simplistic manner, in a year’s time, the amount of data that the organization needs can be calculated as follows:
365 days x A x B x C x D
But this formula is simplistic in that it assumes that no row of data will ever be looked at more than once, which is of course very unrealistic. So now we introduce another variable – E – which is the reuse factor. If E = 1 then there is no reuse of data. If E = 2 then, on the average, each unit of data gets to be looked at twice in a year’s time. If E = 3, then, on the average, each unit of data is looked at three times a year, and so forth.
Now the usage equation for how much data is actually used in a year’s time becomes –
(365 days x A x B x C x D) / E
When you apply this simplistic formula to your organization, it is not unusual to find that the actual data used in your organization is small and actually, much smaller than the volumes of storage that you have. In one case an organization had 4.5 terabytes of data and was looking at no more than 100 gigabytes of data in a year’s time. In other words they were looking at .2% of their data.
This analysis is simplistic because it assumes a certain regularity of processing among the users. Typically these users are called “farmers” and these farmers make up the bulk of most organizations’ DSS processing. But what if you don’t have farmers? What if you have explorers? With explorers there is a completely different pattern of access and processing. With explorers you expect the ratio of data being used to increase. For one thing, explorers operate from the most granular data and for another thing explorers do not have repeatable patterns of access.
Another mitigating factor in the volume of data that is used is that of granularity of data. In general, data with more granularities is better. But there is such a thing as too fine a level of granularity. For example, consider the clickstream data that is generated by the web environment. Clickstream data is generated at such a low level of granularity that it is not useful. Only after clickstream data is filtered, edited and aggregated does it become useful.
It is often times useful in the data warehouse to store some small amount of summary data. As a simple example of summary data that fits inside a data warehouse, consider the data that comes from the quarterly statement prepared by the accounting firm for each publicly traded company. There is no way that the corporate expenses, revenue, profits, etc. are going to be recalculated every time someone needs to see these numbers. Organizations that do the bulk of their processing at the detailed level have a higher ratio of used data. Organizations that take advantage of small amounts of summarization used in their data warehouse have a lower ratio of used data.
But of course the largest mitigating factor as to how much data you are actually using depends on the amount of history kept in the data warehouse. It is axiomatic that the more history that is kept, the more data volume you have. At the same time, the more history kept, translates into less data used as a percentage of the volume of data that resides in the data warehouse.
The actual volume of data in the data warehouse that used is a spurious measurement. What really counts is the volume of data that MIGHT be used. For example suppose that an organization uses 500 gigabytes of data. But suppose another 250 gigabytes of data is needed, not because it IS used but because it MIGHT be used. This is a valid concern to most organizations.
However, if you are using 500 gigabytes, and might use another 500 gigabytes, but have ten terabytes of data in place, then your waste factor is still very high.
Bill is universally recognized as the father of the data warehouse. He has more than 36 years of database technology management experience and data warehouse design expertise. He has published more than 40 books and 1,000 articles on data warehousing and data management, and his books have been translated into nine languages. He is known globally for his data warehouse development seminars and has been a keynote speaker for many major computing associations. Bill can be reached at 303-681-6772.
Editor's Note: More articles, resources and events are available in Bill's BeyeNETWORK Expert Channel. Be sure to visit today!