I am looking for guidance on principles and best practices for designing a data mart or data warehouse. Is it a valid data warehouse concept that data marts and data warehouses get data feeds from various sources and thus are not operational in nature, which means they should not pass information back to any of the source systems?
If so, that means the data flow could either be:
a) Source --> DM --> DW
b) Source --> DW --> DM
But it would never be:
a) Source <--> DM <--> DW
b) Source <--> DW <--> DM
However, I have seen designs where information is passed back to the source system from a data warehouse or data mart. So I need to understand: Are we breaking a principle if we do that, or is it acceptable?
When I first started building data warehouses, I read in the books that it was important for data flow to occur from the source systems into the warehouse and not from the warehouse to the source systems. Fine -- I accepted that this was good practice and followed it. However, I noticed that one of the things we were doing in the data warehouse was cleansing data, and often we would bring data from multiple source systems, cleanse it and then consolidate it -- so that we had, for example, a complete and accurate customer list. This was a great deal of work, and the reason we were doing a great deal of work was that none of the source systems had a clean list of all customers.
After a while, it became apparent that the data warehouse held a far better customer list than any of the source systems did. You can probably see where this is going. We reasoned that if we sent the customer list from the data warehouse back to the source systems, then we could improve the quality of the data in the source systems themselves.
Take this one stage further. Suppose you have five source systems, all of which use customer data. From two of those systems you can create a good-quality, complete, up-to-date customer list. Why should you not create that list and feed it back to all five source systems? Well, we didn't have a good answer to that question, so we did it. And it worked.
Several years later, I began to hear about master data management (MDM) and decided I'd better read up on it and find out what it was. I discovered that we had invented it several years earlier! Of course, we hadn't really invented it; like many other people around the same time, we had simply figured out a sensible process that later became part of MDM.
Am I suggesting this approach to data flow is trivial to do? No. There are many issues that need to be dealt with, not the least of which is the control of database keys. Different source systems tend to use different keys. In a data warehouse, we normally add a surrogate key. How do you ensure that you supply the appropriate key values back to the appropriate source system? Answering that question can be non-trivial indeed. But ultimately, the gains to be had by feeding data back out of data warehouses can far outweigh the development work that's necessary to make it happen. As far as I'm concerned, there is simply no question: If the circumstances warrant doing so, it is perfectly good practice to send data from the warehouse back to your source systems.
Dig Deeper on Database management system (DBMS) architecture, design and strategy
Related Q&A from Mark Whitehorn
Here's a guide to primary, super, foreign and candidate keys, what they're used for in relational database management systems and the differences ... Continue Reading
The unstructured data types common in big data systems are often better managed by a NoSQL database than relational software, Mark Whitehorn says. Continue Reading
Analytics expert Mark Whitehorn explains the strengths of R and how to determine if the open source programming language fits your analytics purposes. Continue Reading