Ever since people started to enter data into computers, there has been a desire to analyze that data. Yet, the...
reality has proved problematic. Computers are great at taking tables of numbers and performing calculations on them at great speed. The problem has always been in deciding which numbers to apply those calculations to, and how to approach the data preparation process.
When I started working in technology in the 1980s, I first came across the elegant notion of a data warehouse, a database designed to hold data specifically for analytics purposes. There were good reasons why it was tricky to apply analytical programming directly to the source transaction systems where data was entered. First, the databases of that time were optimized for transaction processing and were not built for mixed workloads. Accessing large swaths of data -- which was what you wanted for analysis -- would slow down the operation of the critical transaction systems.
Challenges to the data preparation process
There were bigger problems, too. Companies had set up their transaction systems as separate data silos, often run by different departments. This meant the information needed for the context of a transaction -- such as which retail outlet it occurred at, which product was involved and which customer purchased it -- was typically stored locally with that system. As systems proliferated, the data about customers, products, locations and assets was duplicated and became inconsistent across the enterprise.
Inconsistent data wasn't a huge problem for each individual system on its own, but it mattered a great deal when you needed to take a view of data across the enterprise, such as calculating the total sales by product or customer. One huge data warehouse project I became involved with in the late 1990s was justified because the new CEO of a global consumer goods company had decided to rationalize the company's sprawling brand portfolio and focus on only the most profitable brands. Once the organization had gone through the data preparation process and analyzed figures, he was shocked to find no one knew which brands were actually profitable on a global basis in that decentralized company.
Furthermore, the quality of the data itself was often dubious. Human beings typing in customer names and addresses, product codes and prices inevitably made mistakes, despite the best efforts made by the designers of the data entry systems to validate at the source. I recall a project in which a subsidiary of a large organization discovered after a major data clean-up exercise that once all the duplicate entries were removed, the organization had just one-fifth the number of corporate customers it was thought to have had. Combining corporate data with external data from third parties just added to the complications.
Data warehouses were built by IT departments to solve issues related to the data preparation process. These took feeds from multiple source systems, including ERP systems, and often applied data quality tools to clean up the data loaded into the warehouse. The idea was to construct a reliable and authoritative set of data that could be used to satisfy the analytics and reporting needs of the company. Such projects were typically large-scale and ambitious, and some that I worked on at various companies were in the hundred-million-dollar range. Some of the data warehouses worked well; many did not, stumbling across the difficulty of maintaining the integrity of the data as the organizations went through changes, mergers and acquisitions.
Traditional databases ran well when they were set up, but changing their schema structure once running was a major exercise. Adapting the data feeds into the data warehouse to accommodate a new corporate acquisition might take months, during which time the warehouse data was not fully up to date. If there were enough such changes, which could happen with acquisitive corporations, then the data warehouse quickly resembled a cartoon cat chasing its own tail as the staff was unable to keep the warehouse current enough.
At this point, business analysts lost faith and would start to download their own copies of data to spreadsheets for reports. Once that happened, it was a vicious circle as the data warehouse became less and less trusted and the analysts spent more and more time maintaining their shadow copies of data outside the reach of the IT department. Needless to say, such shadow copies did not stay up to date, nor were data quality tools applied to check the accuracy of this data. Such analysts regarded themselves as data freedom fighters, but were regarded by the central IT department as data terrorists.
The current state of data preparation
This is pretty much the stage that many organizations have reached today, something that can be seen by the emergence of tools devoted to the data preparation process. A range of tools, some by new vendors and some by existing ones, has emerged that allow business analysts to prepare their own data for analytical purposes from different internal and external sources. This market had reached $1.78 billion by 2017 and is growing rapidly. Every such tool sold is an indictment of the state of data in corporate data warehouses, because if the latter were working properly then there would be no need for such tools. The rise of big data from nontraditional sources like Hadoop files, in volumes that traditional databases struggle to store and process, has been a further nail in the data warehouse coffin.
In large companies, though, systems are like old soldiers: They never die -- they just fade away. The corporate data warehouses are still there and are being supplemented by Hadoop data lakes and other big data systems, with teams of puzzled analysts struggling to make sense of it all. There are ever more clever reporting and analytics tools to produce impressive charts, but who is to say whether the data that underlies those attractive graphics is truly correct?
For most companies, the original dream of a single, authoritative copy of data for an enterprise has broken down under the sheer weight of expectations that have been placed on it. The rise of the data preparation tools market is proof of that: The technology offers a treatment for a symptom of a fundamental underlying data disease in large organizations.