This is the second half of a two-part excerpt from "Integration of Big Data and Data Warehousing," Chapter 10 of the book Data Warehousing in the Age of Big Data by Krish Krishnan, with permission from Morgan Kaufmann, an imprint of Elsevier. For more about data warehouse architecture and big data check out the first section of this book excerpt and get further insight from the author in this Q&A. To download the full book for 30% off the list price, visit the Elsevier store and use the discount code SAVE3013 any time before Jan. 31, 2014.
This excerpt is from the book Data Warehousing in the Age of Big Data by Krish Krishnan, published by Elsevier Inc, Waltham, MA. ISBN 978-0-12-405891-0. Copyright 2013, Elsevier Inc. For more information, please visit the Elsevier website.
Identifying and classifying analytical processing requirements for the entire set of data elements at play is a critical requirement in the design of the next-generation data warehouse platform. The underpinning for this requirement stems from the fact that you can create analytics at the data discovery level, which is very focused and driven by the business consumer and not aligned with the enterprise version of the truth, and you can equally create analytics after data acquisition in the data warehouse.
Figure 10.5 shows the analytics processing in the next-generation data warehouse platform. The key architecture integration layer here is the data integration layer, which is a combination of semantic, reporting and analytical technologies, which is based on the semantic knowledge framework, which is the foundation of next-generation analytics and business intelligence. This framework is discussed later in this chapter.
Finalizing the data architecture is the most time-consuming task that, once completed, will provide a strong foundation for the physical implementation. The physical implementation will be accomplished using technologies from the earlier discussions, including big data and RDBMS systems.
Physical component integration and architecture
The next-generation data warehouse will be deployed on a heterogeneous infrastructure and architectures that integrate both traditional structured data and big data into one scalable and performing environment. There are several options to deploy the physical architecture, with pros and cons for each option.
The primary challenges that will confront the physical architecture of the next-generation data warehouse platform include data loading, availability, data volume, storage performance, scalability, diverse and changing query demands against the data, and operational costs of maintaining the environment. The key challenges are outlined here and will be discussed with each architecture option.
- With no definitive format or metadata or schema, the loading process for big data is simply acquiring the data and storing it as files. This task can be overwhelming when you want to process real-time feeds into the system, while processing the data as large or microbatch windows of processing. An appliance can be configured and tuned to address these rigors in the setup, as opposed to a pure-play implementation. The downside is that a custom architecture configuration may occur, but this can be managed.
- Continuous processing of data in the platform can create contention for resources over a period of time. This is especially true in the case of large documents, videos or images. If this requirement is a key architecture driver, an appliance can be suitable for this specificity, as the guessing game can be avoided in the configuration and setup process.
- MapReduce configuration and optimization can be daunting in large environments, and the appliance architecture provides you reference architecture setups to avoid this pitfall.
- Data availability has been a challenge for any system that relates to processing and transforming data for use by end users, and big data is no exception. The benefit of Hadoop or NoSQL is to mitigate this risk and make data available for analysis immediately upon acquisition. The challenge is to load the data quickly as there is no pre-transformation required.
- Data availability depends on the specificity of metadata to the SerDe or Avro layers. If data can be adequately cataloged on acquisition, it can be available for analysis and discovery immediately.
- Since there is no update of data in the big data layers, reprocessing new data containing updates will create duplicate data, and this needs to be handled to minimize the impact on availability.
- Big data volumes can easily get out of control due to the intrinsic nature of the data. Care and attention needs to be paid to the growth of data upon each cycle of acquisition.
- Retention requirements for the data can vary depending on the nature of the data and the recency of the data and its relevance to the business:
- Compliance requirements: Safe Harbor, SOX, HIPAA, GLBA and PCI regulations can impact data security and storage. If you are planning to use these data types, plan accordingly.
- Legal mandates: There are several transactional data sets that were not stored online and were required by courts of law for discovery purposes in class-action lawsuits. The big data infrastructure can be used as the storage engine for this data type, but the data mandates certain compliance needs and additional security. This data volume can impact the overall performance, and if such data sets are being processed on the big data platform, the appliance configuration can provide the administrators with tools and tips to zone the infrastructure to mark the data in its own area, minimizing both risk and performance impact.
- Data exploration and mining is a very common activity that is a driver for big data acquisition across organizations, and also produces large data sets as the output of processing. These data sets need to be maintained in the big data system by periodically sweeping and deleting intermediate data sets. This is an area that normally is ignored by organizations and can be a performance drain over a period of time.
- Disk performance is an important consideration when building big data systems, and the appliance model can provide a better focus on the storage class and tiering architecture. This will provide the starting kit for longer-term planning and growth management of the storage infrastructure.
- If a combination of in-memory, SSD and traditional storage architecture is planned for big data processing, the persistence and exchange of data across the different layers can be consuming both processing time and cycles. Care needs to be extended in this area, and the appliance architecture provides a reference for such complex storage requirements.
Calculating the operational cost for a data warehouse and its big data platform is a complex task that includes initial acquisition costs for infrastructure, plus labor costs for implementing the architecture, plus infrastructure and labor costs for ongoing maintenance, including external help commissioned from consultants and experts.