In this excerpt from The Shortcut Guide to Large Scale Data Warehousing and Advanced Analytics, readers will learn...
about data security issues affecting data warehouses and the potential analytical advantages of an enterprise data warehouse system. Readers will also learn about the importance of incorporating a full range of business data into a data warehouse and get tips on creating new reports, designing common data dictionaries and building a single, cohesive data warehouse platform.
Data Security and Availability
Most people have witnessed the negative effect that data breaches have on a corporation. There are two issues in particular that are important to address in securing data in a warehouse. The data warehouse must be able to control who can access what data within the data warehouse (data entitlement). The other issue is providing business continuity and disaster recovery (data availability).
Read more from this series
Exploiting the power of a large enterprise data warehouse
The analytical advantages of an enterprise data warehouse system (reading now)
In terms of data entitlement, the system needs to grant access to some users while denying access to others. With a large scale data warehouse, there is a wider range of data to protect and typically a larger number of distinct user groups. The system needs to work from end to end -- from the staging tables through the operational data stores and analytical and reporting structures.
There is also the question of sensitive data, such as personally identifiable information (PII), HR data and corporate-sensitive information. This data needs to be more rigorously defended and carefully audited than other types of information and will be mixed into the data structures of the warehouse. It may require encryption to adequately protect.
It may seem dangerous to place all this tempting data in one location where a malefactor may gather it all in one fell swoop. However, security is a volatile, moving target. People come and go. They change job responsibilities. Keeping security set is always a challenge. Setting it on multiple systems is much more challenging.
Note: If it is difficult to secure one source of information, it is much more difficult to secure multiple sources for information. For instance, a user may be able to see information from the labor reporting system but not the hourly wage of those employees from the HR system. Nonetheless, these are both staged in the data warehouse so that labor costs can be analyzed.
A single data warehouse system that can manage security throughout the chain of data custody provides the simplest, most easily managed form of data security. A single security system is more easily maintained and kept updated. A large data warehouse that houses the single version of the truth can be configured to be one of the most secure. PII can be encrypted or hashed so that the privacy of individuals can be protected. Data can also be aggregated so that information concerning any given individual is not visible, yet the information can still be used in analysis.
Data availability becomes critical because once the data warehouse becomes the primary source of reporting data for the enterprise, the enterprise cannot afford to do without it. This will impact several areas. This first affects backup and restoration of data. Large data warehouse systems need backup processes that can operate while maintaining performance for the users. Although a large single system is more challenging to back up than a group of smaller systems, it is much easier to keep the backup data synchronized. It is simpler to manage the backup operation and ensure the data is protected.
Note: Performing backups and restorations of large data sets can be difficult in the narrow time windows typically available. When designing a large data warehouse system, the time required for these operations must be carefully considered and addressed.
The second is availability. A single, cohesive system can become a single point of failure. In most cases, however, a system can be designed that eliminates single points of failure. It is a consideration that must be given careful thought. First, the data warehouse system should protect itself from internal failure. Second, if remote location disaster recovery is appropriate (and for data warehouses of this type, it typically is), the system must allow for operation at a remote site. Synchronizing data with a remote site is a major issue that is exacerbated in large data warehouses by the sheer volume of data. To support a large data warehouse, these issues need to be properly engineered.
The advantage to a single system is once again the ease of management. Keeping multiple systems available 24 × 7 is much more burdensome than a single, well-designed system that has clear processes and procedures for maintaining its availability. From an IT standpoint, the burden of maintaining a single larger system is typically lighter than that of maintaining several smaller systems.
Analytic Advantages of Large Data Warehouses
Analyzing data, from aggregation to data mining, provides some of the most profound insights into the business. Analytics can be used to detect trends and help forecast upcoming events. They can be used to fill in the missing gaps in information. They can help identify causal relationships and provide the organization’s key decision makers with the best data on which to make decisions.
Thus, the data warehouse should provide more than a place to store data. It should help organize and build data structures that extract and refine this information. Large data warehouses can provide a strong foundation for this type of analytical data.
Full Range of Business Data
Finding key trends and information within a data set requires a full range of data. The more thorough and detailed the data is, the more accurate the analytics can be. The implications for this are that keeping more data online and available to the system provides a richer source for the analytic system. It also means keeping more sources of conformed data can provide the analytics a broader reach across the organization.
I have worked with many systems where detail data is archived. With higher-volume data, this is a common practice. Once the data is aggregated, there may not be a need to keep the full range of detailed data. Often, the details are removed and higher-level aggregations are kept because the system simply cannot handle the volume of data. However, if data mining is required, the detail must be retained and available to build and train the mining models. I have worked with systems that when the detailed data from the past year or two years ago is required, it needs to be laboriously restored from archives. Or the detailed data from two years ago is simply not available. The ability of the system to handle large volumes of data can dictate the type of analyses that can be performed.
Note: Consider a system designed for a retailer that needed to forecast trends for certain products that are seasonal and linked to specific promotional activities. Although the sales data for these products was available, it was aggregated at the departmental level. In order for the retailer to provide the specific reporting they desired to plan their sales, they needed to restore detailed information for the last three years. The system did not keep the information online because they were running out of room. Although this system archived the detailed data, many other systems do not, and that would have made the analysis impossible. The design did not fully meet the needs of the organization because the limits of the physical architecture drove their decision-making processes.
Enterprise data can often be affected by external factors. External events like stock market trends, weather and commodity market shifts can have a major effect on the data internal to the business. If the data warehouse has the capacity to add this data, both in terms of storage and processing power to incorporate and then utilize the data, then it can add real business value. Conversely, if the system is already running near capacity, adding such data becomes unrealistic.
The ability to store large amounts of data is only part of the issue. Large data warehouse systems that support analytics need to provide enough processing power to aggregate data and build mining models while still remaining responsive to users. Thus, the scale of these systems to perform concurrent operations such as loading data while querying the data securely is an important consideration. Yet the systems that can meet these challenges provide the greater value to the business.
Many organizations require a rich mix of aggregate data and detailed, transaction-level data. Aggregates often identify a subset of transactions that exemplify the desired behavior, or those that indicate trouble within an organization. A system that can drill down into the details and help the analysts find the successes or the missteps will help the enterprise make better-informed decisions.
This again helps define the requirements of the large data warehouse. It needs not only to handle large data volumes but also must have the capacity to load it quickly and remain responsive to users. It must support detailed transactional data structures, aggregate structures and data mining structures that help analysts gain key insights.
Common Data Dictionaries and Information Schemas
Having a large amount of data is not the key to a successful data warehouse. The data needs to be understood by the people who use it. Each source system will have its own designations for various data elements. An order entry date may be called the sale date on a system, the ED on another system and just Date on a third system. The people who use the source systems will think of the data element in the terms of the system they use most. A common reporting system needs to build the bridge across all these departments and systems and help the users enterprisewide to gain a common understanding.
Data warehouses in general serve this purpose quite well. They allow all the data within the organization to be generalized into a canonical schema. Different inputs from different systems can be used to build this schema. As long as the schema is well-thought-out and conformed across the data within the organization, it will serve as the central repository and help people report.
Although it needs to be consistent, the information also needs to be flexible. Businesses change and that means that the information they use to monitor and operate themselves changes. An adaptable schema that can be extended to include new business processes, expansions, mergers and acquisitions will help the enterprise maintain its view of itself. That requires a system that is adaptable and can grow as the business grows. IT needs enough spare capacity to add new sources of data and enough processing power to update changes to the internal information schema without becoming overloaded.
Note: I have worked with analysts who have tried to hunt down data to create reports. Because the data was not well defined, and the analysts did not fully understand the source systems from which the data was drawn, the reports were fatally flawed. It led to bad decisions and cost tens of thousands of dollars. A single source of data that is well documented and understood by the analysts can prevent these types of errors.
Providing the data in a single place will help analysts make use of the information. With reliable information that is clearly defined, the analysts can reach across the organization to find important relationships within internal operations. The system can grow to include external data sources. This allows the analysts better opportunity to find the leading indicators and trends that will help make better decisions.
The canonical information schema for the entire enterprise will help rationalize and relate data from all the systems across the organization. Building on a platform that can scale to meet the growth of demand to house the schema will allow the business to grow and adapt to changes in its markets and internal operations.
Development of New Reports
I have found that when reports are required, the business states what they want. They seldom tell you where to find it. They do not care what you need to go through to gather the raw data or what it will take to process that information into a format that they can use. They just want the report.
An outgrowth of the previous point is that new reports can be developed much more quickly from a single source of data. Searching through the corporate source systems to locate the required data elements for a report is time-consuming and tedious. And if a central reporting repository is not established, it will be repeated for the next report.
That is one of the basic reasons to build a data warehouse. This becomes more important with the growing trend for self-service BI. Users have become much more sophisticated in their use of reporting technology. Report-building tools and spreadsheets have added significant capacity to connect to databases, query for data sets and visualize that data for the user. Letting the user who wants to use the information build his or her own reports can save time and money and enable users to publish information more quickly.
Whether the reports are developed by the development team or business users, the process is much faster from a central data source. Having the data defined, organized and processed makes producing the reports much faster. The ETL processes that cleanse and de-duplicate the data make it more complete and reliable. The additional analytical structures provide richer insights into the operations of the enterprise.
Pulling all this data together often requires a large system. The system needs to multitask, adding new data, building new analytic structures and responding to user queries.
Cost Considerations for Large Scale Data Warehouses
Most people know that in IT terms, bigger means more expensive. As systems get larger, the costs tend to increase dramatically. This may cause many organizations to resist the establishment of a large data warehouse system. Although some of these generalizations are true, there are some offsetting factors that need to be considered when determining the total cost of the data warehousing system.
Building on a Single Cohesive Platform
Many large organizations build several (often competing) data warehouses. They are frequently built on different platforms and use different software. This has a number of implications on the entire price of data warehousing.
I have worked with many organizations that keep smaller, distinct data marts rather than a large scale central data warehouse. These data marts often contain copies of the data found in another data mart. It is not uncommon for one data mart to use another data mart as a source for data. This leads to extra consumption of disk storage, increased use of network bandwidth to move duplicate data and additional ETL processing power to get the same data into multiple data stores. Although the cost of any one data mart may seem small, the cost of all of them can amount to a significant investment.
When companies have multiple data marts of data warehouses, they often host them on diverse platforms. The distinct hardware platforms each need distinct maintenance. Caring for more servers of different makes and models can be more costly. If the servers are mission-critical, the problem increases as the hardware is duplicated to provide redundancy. Different hardware often means different operating systems (OSs) and different database management systems. This causes lower quantities of licenses to be purchased and can mean higher unit costs.
Although a single, large scale data warehouse system may have a higher initial cost than a data mart-size operation, it often proves more economical in the long run. A single data warehouse system that can scale can be sized to meet the current needs and add capacity as the needs of the organization warrant it.
If you are maintaining multiple servers hosting multiple database systems, there are several operational costs that need to be considered:
- Training staff on multiple systems
- Monitoring multiple systems
- Troubleshooting on multiple platforms
- Patching and updating multiple systems
It is simpler and less costly to operate a single system with a unified OS and hardware platform. Monitoring is much easier and troubleshooting is greatly simplified.
This chapter discussed the rationale for building large scale data warehouses. In some cases, the size is driven simply from the size of the data set. Large volumes of data need to be organized and processed to harvest the value. In other cases, there are options. Using multiple data marts or smaller, segmented data warehouses can be an option. Many organizations are better served if the intelligence for the entire enterprise is gathered into one consistent, secure whole.
If the engineering challenge of building, operating and managing a large scale data warehouse can be met economically, it will often provide the best alternative for storing and analyzing the data collected from within and outside of the organization.
More about this book and others like it...
- Intrigued by this chapter excerpt? Download a free PDF of the entire chapter: Exploiting the power of large enterprise data warehouses
- To purchase the book or similar titles, visit the Realtime NEXUS website.