How much historical data is enough?

There are many factors that impact the amount of historical data a corporation requires.

There is an old maxim about how much historical data the end user wants. The end user wants the historical data for two years more than what he/she currently has. If end users have no historical data, then they want the data for the past two years. If they have historical data for 3 years, then they want historical data for 5 years, and so forth. The interesting thing about this old maxim is that it is neither an exaggeration nor an underestimation. It is pretty much correct.

So let’s examine what is behind this desire for an additional two years of historical data. Another way to look at this proposition is that the end user wants to have two full cycles of data on which to do analysis. And in most corporations, a business cycle is usually measured by the year. There are peak periods in the year and slack time in the year. Having just one year’s worth of data means that the analyst can look at just one cycle of business. Analysts are always concerned that if they study just one cycle’s worth of numbers, that somehow there will be a bias in that cycle. Looking at two cycles of data reduces the chances that a freak year has snuck into the system. So there is actually a rationale for wanting to have at least two years' worth of data.

However, some businesses do not operate on the basis of an annual cycle. Some businesses operate on the basis of very different business cycles. Consider life insurance companies. Life insurance companies examine the life of people. In order to understand the life cycle of people, it is necessary to get 90 to 100 years' worth of data. And other businesses have different life cycles as well. A real estate life cycle may be ten years long. An inflationary period may be five or more years long. A period of economic prosperity may be twenty years long.

In any case, there are plenty of circumstances where there is no annual cycle. In these cases, having two years' worth of data does not reflect the life cycle of business at all. So there are companies that will want to store a whole lot more data than two years' worth.

A big part of how much historical data a corporation needs depends on the type of user that will be using the data. There are two basic types of users – farmers and explorers. Farmers are those analysts who know what they want. They do the same type of analysis repeatedly. Typically, farmers submit many requests in a day’s time and are satisfied with only a small amount of data. The only thing that changes for a farmer is the actual content of the data that is being analyzed, not the type of analysis. Farmers are very predictable people. They often find little flakes of gold – little nuggets of wisdom. They seldom find nothing.

Explorers are people who don’t know what they want. Explorers are people who think outside the box. They are very unpredictable and they usually look at very large amounts of data. Explorers often find nothing at all. They have the attitude – “I don’t know what I want, but I will know it when I see it.” Explorers may go six months and submit no requests. Then, in the next week, the explorer may submit ten requests. When explorers find something useful, it may be spectacular. Occasionally, explorers find unexpected huge nuggets of wisdom.

Farmers traditionally do not need a great deal of history. One to two years' worth of history usually meets the needs of a farmer (depending of course on the business cycle).

Explorers, on the other hand, need a lot of historical data. Explorers do the kind of processing that occasionally needs to look at a very lengthy amount of history. Explorers look for patterns of data. And often, there simply is no pattern to be sought and studied. On other occasions, there is a pattern of interest, but that pattern only becomes apparent over a lengthy period of time. If an organization has a lot of explorers, then a great deal of historical data is needed in order to satisfy the curiosity of the explorer.

Once the organization has gathered its historical data, it makes sense to periodically monitor the usage of the data. It is absolutely normal for current and very current data to be used frequently. However, the older the data becomes, the less frequently the data is needed. This true even when there are explorers in the mix of analysts.

When historical data reaches an age where there is very infrequent access of the data, the data can be removed from the system. By removing unused historical data and by placing that data in a remote part of then environment, performance is enhanced. In addition, the cost of the environment is lowered.

There are then many considerations to how much historical data the organization should plan on keeping.

Bill is universally recognized as the father of the data warehouse. He has more than 36 years of database technology management experience and data warehouse design expertise. He has published more than 40 books and 1,000 articles on data warehousing and data management, and his books have been translated into nine languages. He is known globally for his data warehouse development seminars and has been a keynote speaker for many major computing associations. Bill can be reached at 303-681-6772.

Content

Find more PRO+ content and other member only offers, here.

Start the conversation

Send me notifications when other members comment.

• AI functionality limited today but could be a game-changer

Limited AI capabilities could soon give way to technology that is truly transformative for enterprises, surpassing the overhyped ...

SearchAWS

• DynamoDB features recharge AWS database service

Developers who got sick of AWS' roundabout backup method for DynamoDB can breathe a sigh of relief. Backup and Restore, as well ...

• Eight tips to roll a service or app into an AWS deployment

With dozens of services available, it can be a challenge to integrate new services or apps into an AWS deployment. Use these ...

• AWS Instance Scheduler adds more automation options

AWS automation capabilities continue to expand, as the cloud vendor pushes customers to follow its lead to eliminate manual steps...

SearchContentManagement

• Content management in the cloud a main theme in 2018

The future of content management resides in the cloud and with AI, as several 2018 conferences will assure you.

• Six things to know about today's SharePoint implementations

As companies migrate their on-premises Microsoft SharePoint sites to the cloud, here are some things they should know about the ...

• Upgrades for the SharePoint Online portal

As more organizations migrate SharePoint sites to the cloud, Microsoft has increased at-a-glance dashboard data and analytics to ...

SearchOracle

• Oracle Blockchain Cloud Service: What it is and what it does

Oracle's cloud-based blockchain service uses Hyperledger Fabric to support distributed transaction ledgers for corporate users on...

• Using Oracle 12c Unified Auditing to set database audit policies

Oracle Database 12c's built-in Unified Auditing feature streamlines the database auditing process, including creation and ...

• Top Oracle tips and tricks of 2017 you won't want to forget

We've rounded up five of the most notable tip articles we published in 2017, with advice that can help make Oracle projects ...

SearchSAP

• ControlPanelGRC app eases Steelcase's compliance pain

When Steelcase's SAP environment grew in size and complexity, it turned to Symmetry ControlPanelGRC to save time, have more ...

• Translytical data platforms emerge with SAP HANA as a leader

SAP HANA is a leading translytical platform, according to Forrester, and consulting firm Convergent IS says the combination of ...

• SAP HANA and Esri combine for geospatial database platform

SAP and Esri are combining SAP HANA's in-memory database capabilities with Esri geospatial applications, and utilities are taking...