In this excerpt from The Shortcut Guide to Large Scale Data Warehousing and Advanced Analytics, readers will learn...
how the proliferation of data sources, data volumes, users and queries affects enterprise data warehouses. Readers will also learn why many businesses need large enterprise data warehouses and get tips on how to gather and manage information from systems throughout an entire organization.
Chapter 1: Exploiting the Power of Large Enterprise Data Warehouses
For professionals that have been working with databases, data warehouses and reporting applications, the dramatic and inevitable growth of the data that you must manage comes as no surprise. The amount of information that businesses store grows explosively from year to year. In 2007, more than 280 exabytes of data were generated. That’s 10% more than anyone predicted. And enterprises are responsible for storing and securing 85% of that data (source: IDC). To make strategic use of that information is the challenge faced by information architects and database engineers in every segment of the economy.
Read more from this series
Exploiting the power of a large enterprise data warehouse (reading now)
As the amount of data becomes increasingly more difficult to manage, the processes by which we store and retrieve that data changes. And, more significantly, the information that we extract from this data has increasingly more impact on the organizations that we serve. This guide is targeted specifically to address the challenge of large data sets. For our purposes, that means data sets with more than 10TB of data. If you have dealt with these large databases, you know how long it can take to load the data, perform backups, query the data and provide accurate answers in a timely manner to your organization’s key decision makers. You may have a system that works fine at the moment, but you can see the trends and you know that you are nearing the limits of your current architecture. This guide can help you lay a foundation that will allow your data to continue to provide the vital insights into your organization that will help the business thrive.
The Rationale for Large Scale Data Warehouses
To address the issues of large scale data warehouses, it is best to start with the understanding of how they become large. The growth patterns and reasons for their growth will help define the best approaches for dealing with them.
Proliferation of Data Sources
Working with data warehouses, it never ceases to amaze me the number and diversity of systems that are used to provide data to the warehouse. As the data warehouse evolves, it needs data from more systems within the organization. As the organization grows, it adds new departments and new systems to support these departments: new order entry systems, manufacturing control systems, inventory monitoring systems, ERP systems, CRM systems -- the list seems endless.
Take, for example, placing an order with a company. Once, it would have been a single point of entry (for example, a green-screen terminal) into a single order entry system. Orders coming from different sources were likely contained on paper until they reached an order entry clerk who actually typed the order into the system. By today’s standards, this approach seems ridiculous. Now the order comes in through a salesperson using his or her laptop. Or a telesales person who is sitting at a workstation while placing calls. Or through an Internet order placed directly by a customer. Or through an EDI document exchange with a business partner. All of these systems do the same thing, but none of them do it in the same way. And although the end result, an order, is (hopefully) placed in a single system that handles all order entry for the enterprise, each system has details that are significant to the patterns contained in the sales process. Thus, they each become an independent source of data to the data warehouse.
The effect is that the warehouse must handle a large number of data sources. This reality places interesting requirements on the extract, transform and load (ETL) systems and the ability of the warehouse to add new data. Many of these systems have small maintenance windows during which the data is static and can be extracted. The maintenance windows occur when the systems are at their lowest usage. Most of these systems will be at their maintenance windows at the same time, creating a bottleneck for grabbing the data and getting it into the data warehouse. Although each source system may be able to divulge its treasure of knowledge in a short period of time, the data warehouse must absorb the data from all these sources quickly and transform the data to organize it for daily reporting. The ETL process and data loading thus becomes a key factor in the design.
The volume of data entering the data warehouse is another key component. Data warehouses are tackling much larger problems and larger data sets than ever before, working with telecommunications companies, credit card and payment processing organizations, financial institutions that analyze national and international stock transactions, and others; the volume of data is simply staggering. The growth of computer capacity has allowed systems to be developed to process terabytes of data in the course of hours. This functionality allows these large sets of data to be seen in their entirety and opens doors to new types of analyses.
I did work for a telecom that was receiving so much data on a daily basis that the data warehouse was taking more than a day to add the data to the system. It was a losing battle and required a redesign of the data structure and the process for adding data to the fact and dimension tables in order to keep up with the loading process. As the business grew, the shear amount of data grew and made the initial design inadequate.
Note: I have found that data sets increase in size geometrically rather than linearly. It is much more common to underestimate the capacity requirements for a data warehouse than overestimate them.
Thus, the second requirement of a large data warehouse is its ability to handle the growing amount of data it must absorb. A well-designed and -implemented system will balance the cost (in hardware and software) of handling the load at hand and have a clearly articulated plan for expanding as the demand grows. This process will optimize the efficacy of the system and keep the costs of the system in control.
Users and Queries
The next challenge is one of users and queries. It centers on several key factors: the number of simultaneous users, the nature and diversity of the queries themselves, and the total volume of the data being queried.
Early in the development of data warehouses, the information within the warehouse was accessed only by a small number of analysts who would carefully query the data and prepare reports for the users. The reports were distributed as static information to the various interested parties. Technology, computing power and the need for specific tactical data have increased; users expect more interactive access to the data stored in the data warehouse. Self-service business intelligence (BI) and user-driven queries demand that the data become more accessible. A wider range of users access the data, and their queries are more diverse. Different uses of the data put different demands on the data. The data is more difficult to optimize because the queries are more diverse.
Note: The growing demand for self-service BI and the need to provide data to much later numbers of users puts greater demands on the query engines of the data warehouse. System scaling can be difficult if the demands placed on the query engines cannot be accurately projected. Systems that can scale to meet these needs are critical.
While this increase in demand grows, the size of the data itself grows each day. The tables and indexes become larger. The time needed to sort through terabyte-size fact tables grows with each passing week. Although there are many techniques for optimizing the queries, the changing needs and complexity of the user community makes optimization for everyone’s needs increasingly challenging. Many of us have seen queries that took 5 seconds to return when the system was initially loaded grow to 5 minutes, 10 minutes or more. There is only so much cache, so many indexes and so many resources to distribute within the system to meet the demand.
The next requirement for a large scale data warehouse is ability to scale. The system needs to handle a growing number of users, a wide diversity of queries and the ability to scale up to tables that contain an ever-growing number of rows. Well-tuned database systems are typically disk I/O-bound. Thus, systems that can expand across multiple disk subsystems while maintaining the integrity of the entire data set are well suited to meet this challenge. The system needs to scale out the bottlenecks of disk I/O, memory, processing power and network bandwidth to meet the needs of the business.
Many of us have received calls from a credit card company asking whether or not we made a purchase. These calls protect us from illegitimate charges on our credit accounts and help protect us against identity theft. To accomplish this, the credit card company must analyze millions of transactions and find the buying patterns of their customers. They use these patterns to detect outlier transactions, those that fall outside the pattern of regular usage. To accomplish this, the analytic system must continually process and update mining models against the current data. The models locate these outliers. It takes a lot of processing power, and it must be run quickly in order to be effective.
Financial institutions track individual transactions within the markets. Millions of transactions are processed and used to populate models that predict the trends within the market. If a financial firm can ride the crest of these trends, they can make a great deal of money. If they cannot keep up with the deluge of information, they will help fund the firms who can.
Note: Although most data warehouse design is based on the use of historical data, it never ceases to amaze me how quickly business owners want information available. A system that can process data and make it available rapidly will help satisfy this demand.
Analytics require the aggregation of data. The aggregate data can be used to show trends over time. The detailed source data can be mined using advanced algorithms to predict trends or reveal correlations in data that provide profound insights in the activities of the organization. A large data warehouse needs to support these aggregations and data mining processes. Although many can occur cyclically (weekly or monthly), there is an increasing need to perform the analysis in real time, or near real time. Enterprises that can obtain this data more quickly have a distinct advantage over their competitors. Thus, large scale data warehouses often need the ability to scale the aggregation and mining of data. They need to be able to add the processing power to running mining algorithms and perform on-the-fly exploration of the data to give the organization this type of key insight in an expedient manner.
Data warehouses grow because the businesses that build them grow. They gain new data sources, the data volume increases, the number of users and diversity of queries expands, and the need for more advanced and timely analytics increases. Thus, a series of data marts that merge into an enterprise data warehouse or an enterprise data warehouse that continues to grow with increasing detail will get larger. As the size increases, the initial architecture often reaches its limits. The wisdom of allowing a data warehouse to outgrow its platform may then be brought into question. Perhaps allowing the data to grow into the terabytes is not the right answer. However, the only means by which an enterprise can clearly see the entire picture of its activities is to view them as a whole and understand how each composite function affects the others. Although it might be unwieldy to handle all this data, the benefit of having it available in a single, queryable format more than makes up for the effort.
Why Do Businesses Need Large Scale Data Warehouses?
Data warehouses and analytical applications exist to serve the business needs of the organizations that build them. The question then becomes, Is there a real need to let a data warehouse grow into the 10+ terabyte range? They are large, costly, and provide their own special technological challenges, so does the business really need large scale data warehouses? After all, no one is going to use all 10+ TB of data at once. What are the business drivers that lead an enterprise to build these warehouses? By understanding the drivers, we can gain further insight into the requirements for these systems and help define the type of system that will best meet the needs of the organization.
Understanding the Entire Organization
Every department in the organization has its own software to help it perform its work. By design and convenience, each element of the organization has silos of information that they use to do their jobs. This arrangement is often necessary because the applications that collect the information and serve the various departments need to be close to those departments and responsive to the specific needs they serve.
The problem is that some of the detailed information in one department is often useful to another department. The data warehouse becomes a common hub where that information can be stored and easily shared. The data warehouse can help abstract the needs for shared information from all the detailed information that is relevant only to the source system. This can help reduce the overwhelming amount of detail to a higher level that can be more easily shared throughout the organization.
The gathering of information from throughout the entire organization can help provide cross-departmental insight. The information that flows between elements within the enterprise can provide a vantage point that cannot be seen from within the limited view of a single department. I worked on a system that had distributed manufacturing with captive partners and independent vendors located throughout the world. Information on the cost to ship, the availability of manufacturing capacity, the location of raw materials and the current contractual arrangements with the potential sources all needed to be considered before a decision could be made where to manufacture an order. No single system captured enough information to feed the parameters of the decision algorithm, but all the information required was stored in the data warehouse. Although ERP systems often contain much of this information, they frequently do not have everything that is necessary. A data warehouse that can contain all these cross-departmental data sources and keep a longer view of the data will provide unique insight into the operation. As the data in these warehouses grows, there is a need for a larger scale warehouse and analytical system to manage it.
Note: Gathering information from systems throughout the organization is often what leads to data warehouses growing in size. It can be difficult to predict the size from all the data sources that will prove useful (assuming you can even determine them all in advance). Finding a system that can scale out to meet the demand allows the system to grow to meet this need.
Most really large data warehouses, however, grow from really large data sets. Large organizations that handle millions of transactions per day simply have large data sets. As data collection systems have become more automated and sophisticated, they collect more useful information with each transaction. That creates larger and larger data sets. Combined with increases in the number of transactions, it makes the data grow. If the organization wants to continue its growth, it needs to take full advantage of all the information at its disposal. That means handling larger data sets. Consider, as an example, the advances in inventory tracking. Using RFID, automated tracking of inventory and products follows each movement of those products throughout the distribution chain. That tracking can help keep tabs on inventory and optimize its movement -- if an enterprise has the capability of capturing and analyzing all the additional data related to that movement.
Sometimes, data grows quickly overnight. When two organizations merge, they have a great deal of data that needs to be combined to report on the new organization. Each of the constituent organizations may have a data warehouse. But most data warehouses will struggle to double overnight. The effort to normalize data and charts of accounts from multiple organizations often puts demands for extra metadata and requires additional ETL processing. The work required of the combined data warehouse is often greater than the sum of the work required from the individual data warehouses from which it is derived.
I helped work on a data warehouse for an organization that grew through merger and acquisition. The warehouse collected data from more than 50 distinct ERP systems. The enterprise was really a federation of businesses working across the globe. The only realistic means of seeing the organization as a whole was to aggregate the data into a central repository that could collate all the data from the various organizations into a single cohesive source. Although each individual system could meet the reporting needs of their individual organizations, only a large scale data warehouse could handle the aggregation of this data into a single entity and show a picture of the performance of the enterprise as a whole.
To rationalize information from multiple systems, be they departmental or divisional, the data needs to be collated and organized. Even if it is later reduced into smaller data marts that are focused to direct specific questions, putting all the data in one place at one time allows the entire picture to be painted on a single canvas. For many organizations, that is a large amount of data. And that requires a system that can handle that volume of data and grow with the business as it grows.
Note: Creating unified semantics for the data that can be disseminated throughout the enterprise is one of the most challenging hurdles for making use of information. Fragmenting the data across disjointed data marts can make it even more difficult. Starting with a single coherent data source and then extracting smaller units of data makes this much easier and more consistent.
More about this book and others like it...
- Intrigued by this chapter excerpt? Download a free PDF of the entire chapter: Exploiting the power of large enterprise data warehouses
- To purchase the book or similar titles, visit the Realtime NEXUS website.