Monitoring the data warehouse

Bill Inmon looks at dormant data in the data warehouse and provides suggestions for addressing data warehouse performance.

A long time ago, people discovered that they needed to monitor their online systems. When performance slowed to...

a crawl, it was the monitor that helped to discover where the bottleneck was and how to remedy the problem. So when it came to monitoring the transaction, high performance environment, the system programmers of the world were sold on monitors.

The world of data warehousing is an entirely different world. High performance is not an issue. Indeed, if performance is deemed to be acceptable, then that is good enough. But there still are things to be monitored in the world of data warehouse analytical processing. The number one thing that the analyst needs to be wary of is dormant data. Dormant data is data that is not being accessed that resides in a data warehouse. In the early days of a data warehouse, there is hardly any dormant data to be found. But as a data warehouse grows older, dormant data starts to collect.

Dormant data in a data warehouse environment is like cholesterol in the bloodstream of the body. With enough cholesterol, the heart has to pump extra hard just to move blood around. In addition, dormant data costs money. It costs in terms of wasted storage, and it costs in terms of processor cycles needed to move data through a system. Dormant data is not good for the health of the data warehouse environment.

How bad can it get when it comes to dormant data? Recently, a monitor was put on a very large data warehouse, and it was seen that .5% of the data in the data warehouse was actually being accessed. Stated differently, 99.5% of the data in the data warehouse was untouched. And the consumer was getting ready to buy more storage! In any case, dormant data has a very negative impact on the performance and the cost of a data warehouse.

How does an organization go about determining what data is active in the data warehouse and what data is not active? The answer is that a monitor for data warehousing is used. A data warehouse monitor is not to be confused with a transaction processing monitor. Though they are both monitors, they are about as different as chalk and cheese. The data warehouse monitor looks at what queries are being given to the system. Based on the queries, the system can determine what data is being looked at. The monitor needs to present a complete picture of the access of data. The access must be analyzed in terms of:

  • What tables are being accessed

  • What rows within a table are being accessed

  • What columns within a row are being accessed

It is not sufficient to merely monitor what tables are being accessed.

After the access patterns are monitored and the pattern and the frequency of access are delineated, the next step is to subtract the accessed data from the tables. Anything not being accessed is considered to be dormant data. In such a manner, a data warehouse monitor can be used to tell what data is and is not dormant in a data warehouse. The issue of dormant data and its management is the single most important issue in the long-term management of a data warehouse.

The question then arises – how does one go about monitoring the activity in a data warehouse? It is easy to say that the DBMS vendor supplies such a monitor. And, indeed, the DBMS vendor does supply a data warehouse monitor. But there is a fly in the ointment. The DBMS vendor states that the monitor should be turned off during peak period processing. The problem is that the very moment when you need the monitor the most is the moment that you cannot have it. Not being able to use a data warehouse monitor during peak period processing is like giving your kids Christmas presents that cannot be opened in December.

So what people who are serious about monitoring a data warehouse do is use a third-party data warehouse monitor. A third-party data warehouse monitor operates on the basis of examining the data flowing up and down the bit stream. In doing so, there is a modicum of machine resources used. You can use a third-party data monitor during peak period processing and performance will not be compromised. Once the dormant data has been identified, it can be placed onto another physical source of storage. Depending on the particulars, the dormant data can be placed on near-line storage or archival storage. By removing the dormant data from the data warehouse, performance is greatly enhanced.

Of course, if there ever is a need to start to look at and use dormant data, the dormant data can be placed back into the data warehouse. Placing dormant data on alternative forms of storage does not permanently mark the data as unusable or undesirable. Instead, if the probability of dormant data ever being accessed becomes elevated, the dormant data can have its status and its physical location changed.

The other day, a gentleman asked me how to tune a data warehouse. I told him he needed to monitor the data warehouse. He looked at me as if I had said that he should fly to the moon. It was just something he had never considered. In my way of thinking, trying to address data warehouse performance without using a data warehouse monitor is like trying to improve your health without a medical checkup. I suppose it can be done; but all things considered, I am going to let someone else try it.

  • Bill Inmon

    Bill is universally recognized as the father of the data warehouse. He has more than 36 years of database technology management experience and data warehouse design expertise. He has published more than 40 books and 1,000 articles on data warehousing and data management, and his books have been translated into nine languages. He is known globally for his data warehouse development seminars and has been a keynote speaker for many major computing associations.

Dig Deeper on Corporate performance management software