This article originally appeared on the BeyeNETWORK.
First there were applications. Then data warehouses and data marts were developed. Then came the exploration warehouse and the operational data store (ODS) appeared. Finally an architecture known as the corporate information factory (CIF) evolved.
In fact, nearly all vendors and most IT organizations today use the Corporate Information Factory as a basis for modern information systems architecture. The Corporate Information Factory covers all the bases.
When one looks at the Corporate Information Factory, often the first reaction is—“wow, look at all of the data redundancy!” Indeed, as data flows from left to right in the Corporate Information Factory, there is an apparent repeating of the same data across the architecture.
But we really need to discuss what is meant by “redundancy.” Indeed, a careful definition of redundancy is necessary or else we will make some very questionable inferences. The strict definition of redundancy is that of literally copying data. A looser definition of redundancy is that of how data is being changed or formatted from one data form to another.
As shall be seen, with the strict definition of redundancy, there is hardly any redundancy of data in the Corporate Information Factory. It is only where a very loose definition of redundancy is applied that the Corporate Information Factory contains any amount of redundant data at all.
So, what about redundancy of data in the Corporate Information Factory? Let’s examine a banking transaction. In this banking transaction we have:
- To whom the money is to be paid
- What bank the draft is written on
There will be other pieces of information relating to the banking transaction that are captured as well.
The transaction flows into the operational account system of record. At this point, the amount of the transaction is subtracted from the current balance. The current balance is at $5,000. The transaction amount was for $450, so now the current balance becomes $4,550. Is there any redundancy here? No, it is new information.
Next, the transaction is added to the monthly statement register. The only fields that are added are date, amount and check number. Is there redundancy here? There may be some.
The transaction then flows into the branch account daily register. Again, only a few data fields are stripped off to be added to the branch bank’s information. Is there redundancy of data here? Probably there will be some.
Next, the data is sent to a data warehouse. The entirety of the transaction is captured and stored. Is there any redundancy here? The answer is that for the current year’s transaction data, yes there is redundancy. But as each year passes and the current year’s data is stripped from the operational system of record, there is no overlap between the data warehouse and the operational environment. The data warehouse contains the historical record and the operational environment contains the current year. There is a minimum of overlap.
Let’s examine data sent to the data mart environment. Here each detailed transaction from the data warehouse environment is aggregated and summarized with thousands of other detailed transactions. Is this data redundant? It is redundant only in the sense that a summary value may be construed as an aggregate of its parts. The USAGE of the data is significantly different and that is why I would say the data is not redundant. In each instance the data is formatted for a very different purpose and cannot be used for any other purpose. Is the data the same? In some cases, yes, but it is not in the same format and therefore, is significantly different.
The detailed data warehouse data is sent to the exploration data warehouse environment. There is a massive copying of data. In fact, the data that resides in an exploration warehouse is purely redundant with the data residing in the data warehouse. But, for the most part, exploration data warehouses are project-based exercises. This means that exploration warehouses are finite in duration. They do not live forever, like a data warehouse. So, it is true that data is redundant between a data warehouse and an exploration warehouse. But, because exploration warehouses have a finite life, that redundancy is not a permanent thing.
Finally one day, data is sent to the near-line or archival environment. As it is sent to those environments, it is removed from the data warehouse environment. There is no overlap between those environments.
There indeed is a flow of data between the different components of the data warehouse and there is some minor degree of redundancy of data, especially in the operational environment. But elsewhere the Corporate Information Factory is practically redundancy free, something that is not obvious at all when you first look at the Corporate Information Factory.
About the author:
Bill Inmon is universally recognized as the father of the data warehouse. He has more than 36 years of database technology management experience and data warehouse design expertise. He has published more than 40 books and 1,000 articles on data warehousing and data management, and his books have been translated into nine languages. He is known globally for his data warehouse development seminars and has been a keynote speaker for many major computing associations.