News Stay informed about the latest enterprise technology news and product updates.

# Who is going to use the data?

## Bigger isn't always better! The value of data warehouses is determined by real usage not volume.

Place this at the top of all articles: This article originally appeared on the BeyeNETWORK

The other day I visited a data warehouse site to discuss various matters of interest with the customer. The customer was proud that its warehouse has over twenty terabytes of data and is growing at the rate of 70 gigabytes a week.

While the company thought of this as a technological masterpiece, it triggered a thought in my head – who is going to use all this data?

Now that is an interesting question in organizations that have very large data warehouses. Of course the vendor will defend these large data warehouses because they want to sell you more equipment. But at some point don’t you stop needing data?

If you want to find out how much data you need, perform this analysis. Let A be the number of analysts who are using your data warehouse. Let B be the number of queries they do per day. Let C be the number of rows that – on the average – each query consumes. Let D be the average row length in the data warehouse. Now to use these numbers in a simplistic manner, in a year’s time, the amount of data that the organization needs can be calculated as follows:

365 days x A x B x C x D

But this formula is simplistic in that it assumes that no row of data will ever be looked at more than once, which is of course very unrealistic. So now we introduce another variable – E – which is the reuse factor. If E = 1 then there is no reuse of data. If E = 2 then, on the average, each unit of data gets to be looked at twice in a year’s time. If E = 3, then, on the average, each unit of data is looked at three times a year, and so forth.

Now the usage equation for how much data is actually used in a year’s time becomes –

(365 days x A x B x C x D) / E

When you apply this simplistic formula to your organization, it is not unusual to find that the actual data used in your organization is small and actually, much smaller than the volumes of storage that you have. In one case an organization had 4.5 terabytes of data and was looking at no more than 100 gigabytes of data in a year’s time. In other words they were looking at .2% of their data.

This analysis is simplistic because it assumes a certain regularity of processing among the users. Typically these users are called “farmers” and these farmers make up the bulk of most organizations’ DSS processing. But what if you don’t have farmers? What if you have explorers? With explorers there is a completely different pattern of access and processing. With explorers you expect the ratio of data being used to increase. For one thing, explorers operate from the most granular data and for another thing explorers do not have repeatable patterns of access.

Another mitigating factor in the volume of data that is used is that of granularity of data. In general, data with more granularities is better. But there is such a thing as too fine a level of granularity. For example, consider the clickstream data that is generated by the web environment. Clickstream data is generated at such a low level of granularity that it is not useful. Only after clickstream data is filtered, edited and aggregated does it become useful.

It is often times useful in the data warehouse to store some small amount of summary data. As a simple example of summary data that fits inside a data warehouse, consider the data that comes from the quarterly statement prepared by the accounting firm for each publicly traded company. There is no way that the corporate expenses, revenue, profits, etc. are going to be recalculated every time someone needs to see these numbers. Organizations that do the bulk of their processing at the detailed level have a higher ratio of used data.  Organizations that take advantage of small amounts of summarization used in their data warehouse have a lower ratio of used data.

But of course the largest mitigating factor as to how much data you are actually using depends on the amount of history kept in the data warehouse.  It is axiomatic that the more history that is kept, the more data volume you have.  At the same time, the more history kept, translates into less data used as a percentage of the volume of data that resides in the data warehouse.

The actual volume of data in the data warehouse that used is a spurious measurement. What really counts is the volume of data that MIGHT be used. For example suppose that an organization uses 500 gigabytes of data. But suppose another 250 gigabytes of data is needed, not because it IS used but because it MIGHT be used. This is a valid concern to most organizations.

However, if you are using 500 gigabytes, and might use another 500 gigabytes, but have ten terabytes of data in place, then your waste factor is still very high.

Bill is universally recognized as the father of the data warehouse. He has more than 36 years of database technology management experience and data warehouse design expertise. He has published more than 40 books and 1,000 articles on data warehousing and data management, and his books have been translated into nine languages. He is known globally for his data warehouse development seminars and has been a keynote speaker for many major computing associations. Bill can be reached at 303-681-6772.

Editor's Note: More articles, resources and events are available in Bill's BeyeNETWORK Expert Channel. Be sure to visit today!

#### Start the conversation

Send me notifications when other members comment.

• ### Tableau 2020.1 highlighted by Viz Animations, Dynamic Parameters

Dynamic Parameters and Viz Animations highlight Tableau's first 2020 platform update and comes nearly four months after ...

• ### Embedded BI software creates common ground for diverse analytics

Learn how embedding separate business intelligence capabilities into one application empowers users to drill down, access and ...

• ### Embedded BI and analytics apps speed workflows, insight access

Embedded BI is fast becoming a focal point for analytics uses as data analysts, developers and business users enjoy faster ...

## SearchAWS

• ### Amazon grocery strategy could shake up food retail industry

Amazon is a small player in the grocery store industry but one with increasing influence. It already has considerable technical ...

• ### Amazon's environmental impact delivers climate change concerns

Consumers are accustomed to free and fast delivery from Amazon and other e-tailers, but the convenience of online shopping comes ...

• ### New themed Amazon pop-up stores built on consumer data

In its never-ending quest to establish unique physical locations, Amazon has plans to launch a chain of themed pop-up stores that...

## SearchContentManagement

With new Google Drive integration, Adobe embeds free features for commenting and annotating PDFs inside Google Docs. Acrobat ...

• ### 4 benefits of business process automation

Companies are automating business processes to improve workflows and use technology effectively. Some benefits of BPA include ...

• ### Test yourself on the differences between SharePoint and OneDrive

Businesses use Microsoft SharePoint and OneDrive capabilities for a number of reasons. Test your knowledge on the differences ...

## SearchOracle

• ### Oracle Autonomous Database shifts IT focus to strategic planning

This handbook looks at what Oracle Autonomous Database offers to Oracle users and issues that organizations should consider ...

• ### Oracle Autonomous Database features free DBAs from routine tasks

Oracle Autonomous Database can automate routine administrative and operational tasks for DBAs and improve productivity, but ...

• ### Oracle co-CEO Mark Hurd dead at 62, succession plan looms

Oracle co-CEO Mark Hurd's abrupt death at 62 has put the software giant in the position of naming his replacement, and the ...

## SearchSAP

• ### Understand SAP ECC vs. HANA vs. S/4HANA vs. R/3

SAP ECC, HANA, S/4HANA and R/3 are all valuable tools in the IT world, but it's easy to get them confused. Learn about their ...

• ### 5 tips for a successful S/4HANA Cloud implementation

Moving to S/4HANA public cloud can help you save on IT maintenance and simplify real-time analysis. Here's a look at what you can...

• ### Take care of data before SAP S/4HANA migration

In this Q&A, Rajesh Rengarethinam of ERP security vendor Appsian discusses why reviewing data security and business processes are...

## SearchSQLServer

• ### SQL Server database design best practices and tips for DBAs

Good database design is a must to meet processing needs in SQL Server systems. In a webinar, consultant Koen Verbeeck offered ...

• ### SQL Server in Azure database choices and what they offer users

SQL Server databases can be moved to the Azure cloud in several different ways. Here's what you'll get from each of the options ...

• ### Using a LEFT OUTER JOIN vs. RIGHT OUTER JOIN in SQL

In this book excerpt, you'll learn LEFT OUTER JOIN vs. RIGHT OUTER JOIN techniques and find various examples for creating SQL ...

Close