In this excerpt from The Shortcut Guide to Large Scale Data Warehousing and Advanced Analytics, readers will learn about enterprise data correlation and how to get a single view of the truth through a data warehouse -- plus reasons why that’s often difficult to achieve. Readers will also get advice on why IT departments should consider an enterprise data warehouse, how to address the needs of business users building a warehouse and operational challenges that await those implementing large data warehouse systems.
Enterprise Data Correlation
For the business analysts to get the most value from the data warehouse, they must be able to reliably and effectively query across data throughout the data warehouse. Conformed dimensions and clearly defined relationships between tables within the database allow the enterprise to relate the activities of its divisions and departments together and determine how the constituent entities interact as a whole.
Read more from this series
Why IT departments should consider a large data warehouse system (reading now)
When correlating data from multiple systems, there is always the problem of correlating the data in one system to the data in another system. The CRM system has a distinct method of identifying customers. The accounts payable system uses another identifier. The shipping system requires its data to be correlated with the order entry system before it can be matched to the account payable system.
This has two effects on the data warehouse. The data warehouse must contain all the data required to build the chain from one system to another. The CRM, accounts payable and shipping system tables need to be contained within the data warehouse. There are often mapping tables required to bridge the gaps and connect data from one system to another. The data warehouse will also require the processing bandwidth to correlate the data and process it into the requisite fact tables and dimension tables.
When working with source system data, I have often found that it is inaccurate or incomplete. When correlating data across multiple systems, the key fields that correlate data from one system to another must be present. This leads to the need for data cleansing. Data cleansing will require processing power to accomplish and metadata used to determine how to correct errors in the data and to fill in the gaps. Data mining may be used to choose likely substitutes for missing data. Additional lookups might use fuzzy logic to isolate good data. A variety of other techniques, all of which require additional space and computing power, can be used to make the data as usable as possible. Data de-duplication also becomes an issue. Data duplication can occur because of simple misalignments in data that result in erroneous joins in master data that allow a single transaction to be represented as multiple transactions within the framework of systems. Although removing the duplicate rows will ultimately reduce the total size of the data, it requires the data warehouse system to have the capacity to temporarily manage the duplicate rows and the processing power to find and delete the rows. The additional sorting and filtering can place strong demands on the warehouse system.
Note: Many source systems do not maintain clean internal data. The key objective in most source systems is handling individual transactions, so they may not work as hard at completeness -- particularly when the information is not strictly required to complete the transaction. Having the data warehouse cleanse and purify the data will speed report development and help provide better information to the business users.
Many organizations obtain external information. Whether it is common market trends, lists of leads from advertisers or other data services, data dumps from other organizations, payment transaction records, Web server access logs and so on, it needs to be stored. Some of these external feeds are large. They need be carefully archived because they not are available later. That often means storing them in their original form and making copies of the data as it is blended into the data warehouse. The extra storage and copies all add size and require bandwidth.
Once the data is clean, correlated and presented, the business has a comprehensive source for data. They can compare activities in sales with activities in purchasing and manufacturing. This provides the insight that helps the organization view itself as a cohesive whole.
A Single Source for the Truth
One of the most interesting dilemmas encountered in a business is when two reporting systems provide different answers for the same data. Source systems often exchange data and then modify or transform the data. One of the most difficult tasks in any reporting system is to keep de-normalized data consistent and synchronized.
A central data warehouse is often used to identify the authoritative source of data for a given data element. If multiple source systems have a copy of the data, a single representation of the data stored in the warehouse can serve as the arbitrator of the truth. Another related issue is overloaded terms. For instance, there can be three systems that each receives an order, one for the order entry system, a second for a partner vendor and the third for the accounts payable system. Each could have a date entered field, each field could be distinct and each data point important. The user community needs to be able to locate the individual data point that they need so that the data is consistent across reports. Most organizations use multiple reporting systems that make different uses of the same data. Some seek detailed, transactional-level data, while other systems aggregate the information. Drawing from a single source helps keep the data consistent across the systems and reduces confusion and reporting errors.
Having a centralized source for information also helps simplify the process of creating new reports. The data is located in one place and correlation of data is already completed, so it is much simpler to build the report. The efficiency of locating the reporting data in a designated repository makes the completion of new reporting projects much faster and more economical.
Of course, the more reporting done from this data source, the more demand is put on the service. The need to cache data and respond quickly to queries that vary makes tuning a different proposition than it is on smaller siloed systems that contain a less comprehensive breadth of data. The efficiencies of providing a single enterprise data source can also create demand for one that is tuned for a wider gamut of data access.
Note: Although having a single source for the truth is obviously beneficial, it can be difficult to achieve. People become territorial about their information. Providing a central warehouse for information that is consistent and performant can help people surrender “their” data -- in exchange for a broader-based view of the organization as a whole and insight into interactions with other departments.
The demand for these reports can also drive how quickly the system needs to make reporting information available. The moving of data from the source systems to the data warehouse, and the ETL and analytical processing required to make the data available, will help define the capacity and performance requirements for the data warehouse.
Centralizing enterprise data can make it much faster and simpler to find data. It helps make the development of new reports simpler and more economical. It can improve communication and reduce errors and confusion.
The business receives many advantages from a single source of data that is consistent and readily available. As the business grows larger and more diverse, the system will inherently become larger and grow into the category of a large data warehouse. Although the business may benefit from a large, single source of enterprise information, should IT consider alternatives that might overcome some of the technical challenges to maintaining a large, single source of business-critical information? Is it practical to keep all this information in a single system?
Why Should IT Consider Large Data Warehouses?
We have established why having a single, large data warehouse with powerful processing capabilities can help the business deal with large data sets and build a single, comprehensive view of the entire enterprise. From an IT perspective, building such a large, high-power system can be complex and costly. But there are distinct advantages to building a large data warehouse.
Addressing the Real Needs of the Business
Businesses want accurate information about the activities of the organization. They want it immediately, and they often do not consider what it takes to produce this information. From an IT perspective, the details of providing such an environment are where the devil lives.
Most high-level analysis of data is done at an aggregate level. A large retailer may not care that Jane Doe purchased a box of paper clips in a store in Buzzard’s Breath, Wyoming. They often care that the sale of paper clips in the Western district is increasing. More likely, they care that office supply sales are starting to increase. To build the tens of thousands of individual transactions into the aggregates that help purchasing realize that they need to increase purchases of office supplies before they run out and customers find somewhere else to buy them, the system needs a robust method of building analytical structures.
Note: Sometimes the retailer does care that Jane Doe bought a box of paper clips. Data mining can help use this information to classify Jane, and send her targeted ads to increase her interest in office products. The data warehouse often needs to support both scenarios.
I have been working for the past decade on building these types of analytical structures. Regardless of the technology or product used, it is a challenge to build systems that can take millions of rows of data and convert them into the analytical structures that provide this information. When OLAP cubes are being built, they place a strong demand on the database to provide information rapidly. They work best when the data is stored in an analytic structure, such as a star schema. This often leads to multiple copies of the data: a normalized operational data store and the reporting star schema. This, in turn, leads to more data storage requirements and more need for data warehouse processing power.
The business needs access from all its various data systems. The data should be conformed so that information from HR can be connected to information in accounting to information in shipping to information in quality control. The information needs to be detailed enough to point to specific incidents, but then aggregated to provide insight from a higher point of view. The data needs to be collected over time so that trends can be measured.
The data warehouse provides that common point, where detailed and aggregate data can be stored and provided for every part of the organization in a single, easily accessed location. This can help business analysts quickly and cost-effectively develop new reports and visualizations of the information. In addition, it can help key decision makers develop the correct responses to changes within the organization, taking advantage of successes and minimizing missteps.
To support this type of vision, with pervasive reporting and analysis of the entire enterprise, the data warehouse must have the capacity to grow to meet this need. For successful organizations, this will soon mean dealing with terabytes of information. Planning a data warehouse architecture that can scale to meet this need will allow the information systems to keep pace with the expansion of the enterprise.
Developing from a Single Source
Many of the products used to build OLAP cubes can draw from multiple data sources. It is not uncommon for a system to draw from multiple data marts or smaller data warehouses to obtain the data they require. However, systems inevitably work best when the data is located in a single source. Data located on separate systems is seldom well synchronized, requiring additional ETL to affect the processing. Drawing data from multiple systems is typically slower than from a single source.
Note: This situation seems counter-intuitive because using multiple source systems would seem to parallelize the operations across multiple servers to improve performance. In practice, such is very rarely the case.
The nature of analytics is that someone will inevitably want to drill through the aggregates and into the base data. They will want to use the aggregates of interest to isolate a specific subset of transactions and then display (or print or extract to Excel) that subset. Without question, if the aggregate data and the detailed data are stored in the same system, this isolation is a simpler proposition.
A single source of data that stores both aggregate data and detailed transactional data will be the easiest to maintain and operate. That means that building large data warehouses that can store and process those large amounts of data, and that can easily scale with the growth of the data, provide IT with a very potent analytic resource.
I frequently need to scour an enterprise to locate the data that I need. When business requirements are cast, the business users think in terms of the answers that they require -- not the constraints on where that data is located. A report based on a single source of data is much easier to define and produce.
Placing all the data one needs in one place clearly makes it easier to find and use. But it also helps align data in terms of grain. Different source systems have different methods of capturing data over time. Single transactions occur in a specific point in time, such as the moment a product leaves the production line or a shipment leaves the loading dock. But for many significant business events, the process that needs to be tracked takes place over a period of time. The entire sales cycles may take weeks, months, even years. The steps in the cycle are captured in distinct systems -- CRM, order entry, shipping, manufacturing, payment processing and so on. The tricky bit is to get these systems to track from beginning to end so that the activities in the cycle can be related to one another.
The question I am beginning to raise is one of grain. Grain is a broad topic that deals with how much of one quantifiable business event (for example, a sale) should be measured against the factors that drove it (for example, advertising, sales activity, manufacturing, shipping, inventory and so on). If grain can be rationalized and captured within the structure of the database, it provides a very rich depth of relationships between the data sources within the organization.
Building these bridges is often trickier than it sounds at first. The accounting system provides its data once a month when the books close. The manufacturing systems provide information at the end of each day. The HR system provides data on a bi-weekly basis. It is challenging to integrate the information in these systems to allow their output to work together harmoniously.
Note: Although getting the grain in a data warehouse is difficult, it is mainly difficult in the design phase. Once implemented, it provides a great deal of benefit to those who maintain and enhance the reporting systems moving forward.
Diversity in data from all these source systems presents a significant challenge to the information architecture team. But it is dealing with this diversity of source data and fusing it into a cohesive whole that is the basis of the most dramatic business insights.
Building this broad landscape of data sources is greatly simplified when there is a single location where the data is combined and rationalized. Integrating new data sources into the single corporate warehouse mandates that the grain issues within the data be addressed so that the data can be added. And although it might be difficult to add the data initially, once it is done, the data can be used across the board to divine business insight.
The data warehouse system then must be able to scale to continually add new source data. It will need the processing power to transform and integrate the data into the conformed enterprise data schema. It will also need headroom to manage the metadata (data source, date extracted, transformations and so on) that inevitably accompanies this wealth of data sources. Ideally, the system can be sized to meet immediate needs and grow incrementally to accommodate growth within the business.
Many of the organizations with which I have dealt start with a departmental data mart. They plan ahead and use conformed dimensions so that additional departmental data marts can be built. They plan a database infrastructure that seems ridiculously too large for their immediate need, foreseeing that the demand on the database will grow and the data mart will evolve into an enterprise data warehouse.
Two things occur during the course of this evolution that disrupt this simple approach. One is that someone in another department starts a similar project. They build their own data warehouse and it starts to grow. Now two (or three or four) systems begin to grow across the enterprise. The second is that most organizations do not accurately estimate how fast the system will grow. They start with an infrastructure that seems ridiculously too large (knowing how difficult it will be to re-scale later) and outgrow it much faster than they ever imagined that they could.
Eventually, these systems need to merge to form a single version of the truth. Political and territorial issues aside, the platform on which they merge must be able to handle the load. Many organizations start with a platform and then they are bound to it. The cost to move to another platform is prohibitive, even though the platform on which they are operating cannot handle the challenges of the workload. Re-platforming the data warehouse is a vast undertaking and should be done as infrequently as possible. Thus, understanding the scaling capabilities of the platform up front will help determine the right platform on which these data warehouses should be merged (or built in the first place).
Note: My experience has shown that data warehouses grow much faster than anyone predicts. It is a difficult decision to build for a point three years from now when the data warehouse will host 15 TB of data when today it barely hosts 1 TB. However, it is often less difficult than trying to re-platform once the data is 15 TB and everyone is complaining about the poor performance.
The cost in terms of licensing, support staff and effort to maintain multiple data warehouses can be high. It is also interesting to note that these data warehouses are often maintained on competing platforms, so some of the cost advantages of purchasing more software licenses and similar hardware from a single vendor are lost. If different platforms are used, then the IT staff must maintain a wider diversity of skills, which can lead to higher staffing costs. Also, the database systems that are economical for hosting data marts will often not scale to the large sizes of enterprisewide data warehouses.
This is not a critique of using data marts to grow into an enterprise data warehouse. It is a quite effective technique that has a great deal of merit and can pay for itself as it grows. Rather, the point is that the selection of the platform on which the data marts are built should be considered carefully. Can the platform absorb multiple data marts? Can it grow to meet demand over time?
This situation creates somewhat of a distinctive requirement. The data warehouse platform needs to be able to scale out to meet growing demand. Finding a system that can provide a high level of performance at the onset of the project and scale out to maintain those levels of performance as the workloads continue to evolve and expand becomes the preferred system.
The need to maintain reporting systems in a consistent and controlled manner has caused many organizations, from governments on down, to implement regulations and strict procedures on how major IT systems are changed and maintained. The reaction is quite understandable (if sometimes poorly implemented). Thus, IT needs to keep a close watch on all of their systems, but an even closer eye on the reporting systems that supply key information to the key decision makers, regulators and stockholders.
When reporting systems are scattered throughout the organization, the task of keeping them well audited and regulated can be stressful. The systems often keep de-normalized versions of the same data. The de-normalized data does not always agree across systems. An authoritative source of data needs to be defined. To get the reports to harmonize, they should all draw their information from the single source of truth. As the source systems are inevitably heterogeneous systems, a common platform for this authoritative data, the data warehouse, is often the most practical solution.
Having a single data warehouse makes the control and auditing of this reporting information a much simpler task. Monitoring the activities and changes in a single system is always preferable to monitoring multiple systems. Change in a single system, the data warehouse, can easily control the source of authoritative data. Changes in the source are promulgated to all the key reporting systems that use that data without a wide variety of changes scattered throughout the infrastructure. Tracing data lineage and auditing data collection becomes much simpler in a single data warehouse scenario.
The data warehouse then will need to house a number of additional structures. It needs to supply space for maintaining metadata for the data imports. It needs to provide auditing for changes made to the system. It may also need to audit data access. This will increase the demands on the system. But a single system that provides this information will be much easier to manage. It will be easier to keep in regulatory and corporate compliance than several independent systems. Just the reduction in effort in tracking changes within the information infrastructure on a single system versus several distinct, independent systems can represent a significant savings in time, cost and stress.
More about this book and others like it...
- Intrigued by this chapter excerpt? Download a free PDF of the entire chapter: Exploiting the power of large enterprise data warehouses
- To purchase the book or similar titles, visit the Realtime NEXUS website.
This was first published in May 2012