This is the first part of a two-part excerpt from "Integration of Big Data and Data Warehousing," Chapter 10 of the book Data Warehousing in the Age of Big Data by Krish Krishnan, with permission from Morgan Kaufmann, an imprint of Elsevier. For more about data warehousing and big data integration, check out the second part of this book excerpt and get further insight from the author in this Q&A. To download the full book for 30% off the list price, visit the Elsevier store and use the discount code SAVE3013 any time before Jan. 31, 2014.
Data integration refers to combining data from different source systems for usage by business users to study different behaviors of the business and its customers. In the early days of data integration, the data was limited to transactional systems and their applications. The limited data set provided the basis for creating decision support platforms that were used as analytic guides for making business decisions.
The growth of the volume of data and the data types over the last three decades, along with the advent of data warehousing, coupled with the advances in infrastructure and technologies to support the analysis and storage requirements for data, have changed the landscape of data integration forever.
Traditional data integration techniques have been focused on ETL, ELT, CDC and EAI types of architecture and associated programming models. In the world of big data, however, these techniques will need to either be modified to suit the size and processing complexity demands, including the formats of data that need to be processed. Big data processing needs to be implemented as a two-step process. The first step is a data-driven architecture that includes the analysis and design of data processing. The second step is the physical architecture implementation, which is discussed in the following sections.
This excerpt is from the book Data Warehousing in the Age of Big Data by Krish Krishnan, published by Elsevier Inc, Waltham, MA. ISBN 978-0-12-405891-0. Copyright 2013, Elsevier Inc. For more information, please visit the Elsevier website.
In this technique of building the next-generation data warehouse, all the data within the enterprise are categorized according to the data type, and depending on the nature of the data and its associated processing requirements, the data processing is completed using business rules encapsulated in processing logic and integrated into a series of program flows incorporating enterprise metadata, MDM, and semantic technologies such as taxonomies.
Figure 10.3 shows the inbound data processing of different categories of data. This model segments each data type based on the format and structure of the data, and then processes the appropriate layers of processing rules within the ETL, ELT, CDC or text-processing techniques. Let us analyze the data integration architecture and its benefits.
As shown in Figure 10.3, there are broad classifications of data:
- Transactional data. The classic OLTP data belongs to this segment.
- Web application data. The data from Web applications that are developed by the organization can be added to this category. This data includes clickstream data, Web commerce data, and customer relationship and call center chat data.
- EDW data. This is the existing data from the data warehouse used by the organization currently. It can include all the different data warehouses and data marts in the organization where data is processed and stored for use by business users.
- Analytical data. This is data from analytical systems that are deployed currently in the organization. The data today is primarily based on EDW or transactional data.
- Unstructured data. Under this broad category, we can include:
- Text: documents, notes, memos, contracts
- Images: photos, diagrams, graphs
- Videos: corporate and consumer videos associated with the organization
- Social media: Facebook, Twitter, Instagram, LinkedIn, Forums, YouTube, community websites
- Audio: call center conversations, broadcasts
- Sensor data: includes data from sensors on any or all devices that are related to the organization's line of business. For example, smart meter data makes a business asset for an energy company, and truck and automotive sensors relate to logistics and shipping providers such as UPS and FedEx.
- Weather data: used by both B2B and B2C businesses today to analyze the impact of weather on the business; has become a vital component of predictive analytics.
- Scientific data: applies to medical, pharmaceutical, insurance, healthcare and financial services segments where a lot of the number-crunching type of computation is performed, including simulations and model generation.
- Stock market data: used for processing financial data in many organizations to predict market trends, financial risk and actuarial computations.
- Semi-structured data. This includes emails, presentations, mathematical models and graphs, and geospatial data.
With the different data types clearly identified and laid out, the data characteristics -- including the data type, the associated metadata, the key data elements that can be identified as master data elements, the complexities of the data, and the business users of the data from an ownership and stewardship perspective -- can be defined clearly.
The biggest need for processing big data is workload management, as discussed in earlier chapters.
The data architecture and classification allow us to assign the appropriate infrastructure that can execute the workload demands of the categories of the data.
There are four broad categories of workload based on volume of data and the associated latencies that data can be assigned to (Figure 10.4). Depending on the type of category, the data can then be assigned to physical infrastructure layers for processing. This approach to workload management creates a dynamic scalability requirement for all parts of the data warehouse, which can be designed by efficiently harnessing the current and new infrastructure options. The key point to remember at this juncture is that the processing logic needs to be flexible to be implemented across the different physical infrastructure components, since the same data might be classified into different workloads depending on the urgency of processing.
The workload architecture will further identify the conditions of mixed workload management where the data from one category of workload will be added to processing along with another category of workload.
For example, processing high-volume, low-latency data with low-volume, high-latency data creates a diversified stress on the data-processing environment, where you normally would have processed one kind of data and its workload. Add to this complexity the user query and data loading happening at the same time or in relatively short intervals, and now the situation can get out of hand in quick succession and impact the overall performance. If the same infrastructure is processing big data and traditional data together with all of these complexities, the problem just compounds itself.
The goal of using the workload quadrant is to identify the complexities associated with the data processing and how to mitigate the associated risk in infrastructure design to create the next-generation data warehouse.