This article originally appeared on the BeyeNETWORK.
Data warehouses are a lot of work. Once they are built, they cost money. They need to be monitored. People are constantly requesting changes and additions. The cost of storage quickly adds up.
But the real difficulty of data warehouses comes in the transformation. Older applications have to be merged together when they were never designed to be merged. Definitions of data are confusing. Merging older technology is a real challenge because there simply are not people with skills available that understand the technology of yesterday. There is scant documentation, and where there is documentation, it is out of date.
All in all, data warehouses are quite a mess. They are not easy to build, they are not particularly easy to operate, and they are expensive. So why do we need data warehouses? Why don’t we build “sort of” data warehouses? Why don’t we build “virtual” data warehouses?
If we build virtual data warehouses, we can avoid the nasty work of integration that a real data warehouse entails. Furthermore, we can have the virtual data warehouse up and running in a fraction of the time that it would take to build a real data warehouse. This sounds pretty appealing, and a lot of people have fallen for this approach only to find that a virtual data warehouse is a mirage. You are walking in the desert and dying of thirst. You look up to the horizon and see a whole lake of water. Blessed water. If you just keep plodding along, you will reach that water and slake your thirst – but it is only that, a mirage.
What are the problems are with virtual data warehouses? Let’s see why they never work.
When most people talk about a virtual data warehouse, they talk about lashing together several existing databases. Through a query, they make simultaneous queries of these databases. The result of the query is the same as if there had been a real data warehouse. Sort of. Actually, not really at all.
So what’s wrong with this virtual data warehouse approach? It surely sounds good, if we could just make it work. The problem is that there are a million things fundamentally wrong with the virtual data warehouse approach. These problems include:
- There is a dollar figure in the three databases that are lashed together. The problem is that in one database, the dollars are in U.S. currency. In another database, the dollars are in Canadian currency. The third database holds dollars in Australian currency. You simply cannot just add dollar figures together meaningfully.
- Response time is a mess. One database is being reorganized and will be ready tomorrow. The problem is that a person wants query results back today.
- Another problem is technology. One database is Oracle, another database is UDB, and a third database is MS SQL Server. In order to operate efficiently, these database management systems must be intermeshed tightly. The problem is that Larry Ellison, Bill Gates, and the boys from Armonk don’t sit down and do lunch very often. In fact they are competitors.
- The resources required to do queries is significant. The queries must go out on a network, query a database, bring the data back, stage the data and coordinate with other data that has been returned. Then, the data is available for analysis. Upon doing a second query, the entire process must be repeated. The resources required to accommodate such a query start to add up very quickly.
- There is no historical data to speak of. Each application participating in the virtualization recognizes the need for performance. In doing so, the application jettisons historical data as soon as it can. In one application is a week’s worth of data. In another application, there is only current data. And in yet another application, there is 14 days worth of data. All the virtualization in the world is not going to help when the query needs to look for monthly or even annual data. The historical data that is needed simply does not exist.
- Metadata is a problem. Some databases have scant amounts of metadata; other data bases have a fair amount of metadata. The problem is that the metadata is very different from one database to another. Trying to get a handle on all of the metadata found in the “virtualized” databases is very difficult.
- Consistency of integration is an issue. An analyst recognizes that data from the three databases needs to be integrated. So the analyst accesses the databases and does the integration. The problem is that another analyst also recognizes that integration needs to occur. The second analyst also integrates data. The problem is that the first analyst has integrated the data differently from the second analyst.
- There is no integrity of data. Because update of data is occurring in the databases that are virtualized, a report run at 10:00 a.m. is different from the exact same report run at 2:15 p.m. on the same day. No one really knows what the correct data values are.
And the list goes on. This is merely a starter set of reasons why virtualization simply doesn’t work. Plodding into the desert leads to greater thirst, not an arrival at a lake full of water.
The next time someone tries to sell you virtualization, pull out this list of very real factors and see how the theoretician that is trying to sell you virtualization squirms. And I guarantee you that the person selling you virtualization is a theoretician because from a practical standpoint, virtualization simply doesn’t work.
Bill is universally recognized as the father of the data warehouse. He has more than 36 years of database technology management experience and data warehouse design expertise. He has published more than 40 books and 1,000 articles on data warehousing and data management, and his books have been translated into nine languages. He is known globally for his data warehouse development seminars and has been a keynote speaker for many major computing associations.
Don't overstate the case for machine learning in databases