This article originally appeared on the BeyeNETWORK.
Most people think short term. That is natural because for most of us, the events and demands that we face are short-termed. However, occasionally it makes sense to look at the long term. In some cases, the long term unfolds in front of us in such a slow manner that it is not apparent at all what it going on. This is true of geologic activity, such as the movement of continental plates. It is also true of the life cycle of data.
Consider that in the early days of our industry, we were concerned with systems that created or generated data. Typical of these were transaction-based systems where data was captured or generated as a byproduct of the execution of a transaction. Many systems – indeed many technologies – grew up around these very immediate, very current data systems. We had online transaction processing. We had reports. We had updates and the like.
And then these systems grew so large and had such stringent processing performance requirements that we had online databases and entire online environments. Very current data was the order of the day.
Next came the desire for integration across these many systems. And as data began to age in these systems, there came data warehousing. Data in a data warehouse was older than data in an online database. There was a lot more data in the data warehouse for no reason other than the data warehouse held data that lasts for a long time.
However, over time, the probability of access for data in the data warehouse diminished, usually as the data aged. Thus, as the probability of access lessened, data was put into near-line storage. There, the data could be safely kept for a long period of time at little cost. But even this data aged. At some point, for data in near-line storage, the probability of access started to approach zero. It is at this point that the corporation has to make a decision – do we throw the data away, or do we keep it in an archival environment? Usually, the data is kept in an archival environment because it is felt that if the data is already captured automatically, then to reconstruct the data at a later point in time would be expensive – if, in fact, the data could be reconstructed at all.
Data then goes through several predictable phases in its lifetime –
- Capture, where data is edited and initially placed into an electronic format
- Current access, where data is accessed online
- Collective data warehouse access, where data is integrated over numerous applications
- Near-line storage access
- Archival access
These phases – or at least the early phases – can be seen in a bank transaction.
For example, you enter the bank on a Tuesday morning. You ask to withdraw $100. The transaction is captured electronically. Upon capture, the transaction goes into an online database, thus enabling you to view your transaction online. For 30 to 60 days, you can look at the transaction online. Then, after 60 days or so, the details of the transaction are put into a data warehouse. Now, when you go to the bank and want to look at a transaction that took place last year, the bank turns to the data warehouse. Response time is slower, but you do not often go looking for transactions that happened one year ago – unless you are audited by the IRS. Then, you have reason to go looking for data that is five years old. This data is not in a data warehouse. In order to find the details of your account, you need to go to near-line storage. The data is there, but it requires some looking. Thank goodness people aren’t audited by the IRS very frequently.
Continuing the banking example, inevitably, one day you will die. In settling your estate, the bank may need to review a transaction that happened 20 years ago. The executor of the estate submits a request to the bank and the archival records are pulled up. It doesn’t happen very often that such a request is made, and finding the data in the archival environment is a tussle, but the records are produced after a lengthy and complex search.
It is interesting to use the life cycle of data as a basis for looking at how we build data warehouses and conduct other practices today. How many times have people built a data warehouse where the data warehouse will grow eternally? People build data warehouses as if there is no life cycle of data – as if data always goes into a data warehouse but never comes out.
Some hardware vendors even encourage this kind of thinking. Why? Because the more data the vendor can have the client pull onto their technology, the more money the client is going to be obliged to spend. This vendor attitude does the client no good in the long haul.
The reality is that, at some point in time, data needs to depart the data warehouse.
It is interesting to look at the industry and see how technologies have grown up around the data life cycle. There are many mature technologies for electronically capturing data. There are many mature technologies for online storage and display of data. There are a few technologies in support of data warehousing. There are even fewer technologies in support of near-line storage. And there are only a handful of technologies in support of archival processing. The earlier in the data life cycle that a technology operates, the more mature the technology.