Home > Data warehouse architectures, concepts and phases
Chapter Download:
EMAIL THIS

Data warehouse architectures, concepts and phases

10 Jun 2009 | by Matteo Golfarelli and Stefano Rizzi

Tips, expert advice and sample chapters
Digg This!    StumbleUpon Toolbar StumbleUpon    Bookmark with Delicious Del.icio.us    Add to Google

Data Warehouse Design book cover
This excerpt from Data Warehouse Design: Modern Principles and Methodologies, explains eight different types of data warehouse architecture including single-, two- and three-layer architecture, bus architecture, federated architecture and hub-and-spoke. Learn about ETL processes and data staging and how to get relevant data through extraction. Plus, read defenitions of data marts and legacy systems in this data warehouse architecture tutorial.

Table of contents:

An introduction to data warehousing
Data warehouse architectures, concepts and phases
The advantages of multidimensional databases and cube modeling
Accessing data warehouses best practices
Similarities and differences between ROLAP, MOLAP and HOLAP

1.3 Data Warehouse Architecture

The following architecture properties are essential for a data warehouse system (Kelly, 1997):

  • Separation Analytical and transactional processing should be kept apart as much as possible.
  • Scalability Hardware and software architectures should be easy to upgrade as the data volume, which has to be managed and processed, and the number of users' requirements, which have to be met, progressively increase.
  • Extensibility The architecture should be able to host new applications and technologies without redesigning the whole system.
  • Security Monitoring accesses is essential because of the strategic data stored in data warehouses.
  • Administerability Data warehouse management should not be overly difficult.
Two different classifications are commonly adopted for data warehouse architectures. The first classification, described in sections 1.3.1, 1.3.2, and 1.3.3, is a structure-oriented one that depends on the number of layers used by the architecture. The second classification, described in section 1.3.4, depends on how the different layers are employed to create enterprise-oriented or department-oriented views of data warehouses.


Copyright info
Excerpted from Data Warehouse Design: Modern Principles and Methodologies by Matteo Golfarelli and Stefano Rizzi (McGraw-Hill; 2009) with permission from McGraw-Hill.
1.3.1 Single-Layer Architecture
A single-layer architecture is not frequently used in practice. Its goal is to minimize the amount of data stored; to reach this goal, it removes data redundancies. Figure 1-2 shows the only layer physically available: the source layer. In this case, data warehouses are virtual. This means that a data warehouse is implemented as a multidimensional view of operational data created by specific middleware, or an intermediate processing layer (Devlin, 1997).

Figure 1-2: Single-layer architecture for a data warehouse system

Single-layer architecture for a data warehouse system

The weakness of this architecture lies in its failure to meet the requirement for separation between analytical and transactional processing. Analysis queries are submitted to operational data after the middleware interprets them. It this way, the queries affect regular transactional workloads. In addition, although this architecture can meet the requirement for integration and correctness of data, it cannot log more data than sources do. For these reasons, a virtual approach to data warehouses can be successful only if analysis needs are particularly restricted and the data volume to analyze is huge.


What is a legacy system?
Definition: The term legacy system denotes corporate applications, typically running on mainframes or minicomputers, that are currently used for operational tasks but do not meet modern architectural principles and current standards. For this reason, accessing legacy systems and integrating them with more recent applications is a complex task. All applications that use a nonrelational database are examples of legacy systems.
1.3.2 Two-Layer Architecture
The requirement for separation plays a fundamental role in defining the typical architecture for a data warehouse system, as shown in Figure 1-3. Although it is typically called a two-layer architecture to highlight a separation between physically available sources and data warehouses, it actually consists of four subsequent data flow stages (Lechtenbörger, 2001):

  1. Source layer A data warehouse system uses heterogeneous sources of data. That data is originally stored to corporate relational databases or legacy databases, or it may come from information systems outside the corporate walls.
  2. Data staging The data stored to sources should be extracted, cleansed to remove inconsistencies and fill gaps, and integrated to merge heterogeneous sources into one common schema. The so-called Extraction, Transformation, and Loading tools (ETL) can merge heterogeneous schemata, extract, transform, cleanse, validate, filter, and load source data into a data warehouse (Jarke et al., 2000). Technologically speaking, this stage deals with problems that are typical for distributed information systems, such as inconsistent data management and incompatible data structures (Zhuge et al., 1996). Section 1.4 deals with a few points that are relevant to data staging.
  3. Data warehouse layer Information is stored to one logically centralized single repository: a data warehouse. The data warehouse can be directly accessed, but it can also be used as a source for creating data marts, which partially replicate data warehouse contents and are designed for specific enterprise departments. Meta-data repositories (section 1.6) store information on sources, access procedures, data staging, users, data mart schemata, and so on.
  4. Analysis In this layer, integrated data is efficiently and flexibly accessed to issue reports, dynamically analyze information, and simulate hypothetical business scenarios. Technologically speaking, it should feature aggregate data navigators, complex query optimizers, and user-friendly GUIs. Section 1.7 deals with different types of decision-making support analyses.
The architectural difference between data warehouses and data marts needs to be studied closer. The component marked as a data warehouse in Figure 1-3 is also often called the primary data warehouse or corporate data warehouse. It acts as a centralized storage system for all the data being summed up. Data marts can be viewed as small, local data warehouses replicating (and summing up as much as possible) the part of a primary data warehouse required for a specific application domain.

What is a data mart?
Definition: A data mart is a subset or an aggregation of the data stored to a primary data warehouse. It includes a set of information pieces relevant to a specific business area, corporate department, or category of users.
Figure 1-3: Two-layer architecture for a data warehouse system
 Two-layer architecture for a data warehouse system

The data marts populated from a primary data warehouse are often called dependent. Although data marts are not strictly necessary, they are very useful for data warehouse systems in midsize to large enterprises because

  • they are used as building blocks while incrementally developing data warehouses;
  • they mark out the information required by a specific group of users to solve queries;
  • they can deliver better performance because they are smaller than primary data warehouses.

Sometimes, mainly for organization and policy purposes, you should use a different architecture in which sources are used to directly populate data marts. These data marts are called independent (see section 1.3.4). If there is no primary data warehouse, this streamlines the design process, but it leads to the risk of inconsistencies between data marts. To avoid these problems, you can create a primary data warehouse and still have independent data marts. In comparison with the standard two-layer architecture of Figure 1-3, the roles of data marts and data warehouses are actually inverted. In this case, the data warehouse is populated from its data marts, and it can be directly queried to make access patterns as easy as possible.

The following list sums up all the benefits of a two-layer architecture, in which a data warehouse separates sources from analysis applications (Jarke et al., 2000; Lechtenbörger, 2001):

  • In data warehouse systems, good quality information is always available, even when access to sources is denied temporarily for technical or organizational reasons.
  • Data warehouse analysis queries do not affect the management of transactions, the reliability of which is vital for enterprises to work properly at an operational level.
  • Data warehouses are logically structured according to the multidimensional model, while operational sources are generally based on relational or semi-structured models.
  • A mismatch in terms of time and granularity occurs between OLTP systems, which manage current data at a maximum level of detail, and OLAP systems, which manage historical and summarized data.
  • Data warehouses can use specific design solutions aimed at performance optimization of analysis and report applications.

Note: A few authors use the same terminology to define different concepts. In particular, those authors consider a data warehouse as a repository of integrated and consistent, yet operational, data, while they use a multidimensional representation of data only in data marts. According to our terminology, this "operational view" of data warehouses essentially corresponds to the reconciled data layer in three-layer architectures.


1.3.3 Three-Layer Architecture
In this architecture, the third layer is the reconciled data layer or operational data store. This layer materializes operational data obtained after integrating and cleansing source data. As a result, those data are integrated, consistent, correct, current, and detailed. Figure 1-4 shows a data warehouse that is not populated from its sources directly, but from reconciled data.

The main advantage of the reconciled data layer is that it creates a common reference data model for a whole enterprise. At the same time, it sharply separates the problems of source data extraction and integration from those of data warehouse population. Remarkably, in some cases, the reconciled layer is also directly used to better accomplish some operational tasks, such as producing daily reports that cannot be satisfactorily prepared using the corporate applications, or generating data flows to feed external processes periodically so as to benefit from cleaning and integration. However, reconciled data leads to more redundancy of operational source data. Note that we may assume that even two-layer architectures can have a reconciled layer that is not specifically materialized, but only virtual, because it is defined as a consistent integrated view of operational source data.

Figure 1-4: Three-layer architecture for a data warehouse system
Three-layer architecture for a data warehouse system

Finally, let's consider a supplementary architectural approach, which provides a comprehensive picture. This approach can be described as a hybrid solution between the single-layer architecture and the two/three-layer architecture. This approach assumes that although a data warehouse is available, it is unable to solve all the queries formulated. This means that users may be interested in directly accessing source data from aggregate data (drill-through). To reach this goal, some queries have to be rewritten on the basis of source data (or reconciled data if it is available). This type of architecture is implemented in a prototype by Cui and Widom, 2000, and it needs to be able to go dynamically back to the source data required for queries to be solved (lineage).

Note: Gupta, 1997a; Hull and Zhou, 1996; and Yang et al., 1997 discuss the implications of this approach from the viewpoint of performance optimization, and in particular view materialization.


1.3.4 An Additional Architecture Classification
The scientific literature often distinguishes five types of architecture for data warehouse systems, in which the same basic layers mentioned in the preceding paragraphs are combined in different ways (Rizzi, 2008).

In independent data marts architecture, different data marts are separately designed and built in a nonintegrated fashion (Figure 1-5). This architecture can be initially adopted in the absence of a strong sponsorship toward an enterprise-wide warehousing project, or when the organizational divisions that make up the company are loosely coupled. However, it tends to be soon replaced by other architectures that better achieve data integration and cross-reporting.

The bus architecture, recommended by Ralph Kimball, is apparently similar to the preceding architecture, with one important difference. A basic set of conformed dimensions (that is, analysis dimensions that preserve the same meaning throughout all the facts they belong to), derived by a careful analysis of the main enterprise processes, is adopted and shared as a common design guideline. This ensures logical integration of data marts and an enterprise-wide view of information.

In the hub-and-spoke architecture, one of the most used in medium to large contexts, there is much attention to scalability and extensibility, and to achieving an enterprise-wide view of information. Atomic, normalized data is stored in a reconciled layer that feeds a set of data marts containing summarized data in multidimensional form (Figure 1-6). Users mainly access the data marts, but they may occasionally query the reconciled layer.

Figure 1-5: Independent data marts architecture
Independent data marts architecture

Figure 1-6: Hub-and-spoke architecture
Hub-and-spoke architecture

The centralized architecture, recommended by Bill Inmon, can be seen as a particular implementation of the hub-and-spoke architecture, where the reconciled layer and the data marts are collapsed into a single physical repository.

The federated architecture is sometimes adopted in dynamic contexts where preexisting data warehouses/data marts are to be noninvasively integrated to provide a single, cross-organization decision support environment (for instance, in the case of mergers and acquisitions). Each data warehouse/data mart is either virtually or physically integrated with the others, leaning on a variety of advanced techniques such as distributed querying, ontologies, and meta-data interoperability (Figure 1-7).

Figure 1-7: Federated architecture
Federated architecture

The following list includes the factors that are particularly influential when it comes to choosing one of these architectures:

  • The amount of interdependent information exchanged between organizational units in an enterprise and the organizational role played by the data warehouse project sponsor may lead to the implementation of enterprise-wide architectures, such as bus architectures, or department-specific architectures, such as independent data marts.
  • An urgent need for a data warehouse project, restrictions on economic and human resources, as well as poor IT staff skills may suggest that a type of "quick" architecture, such as independent data marts, should be implemented.
  • The minor role played by a data warehouse project in enterprise strategies can make you prefer an architecture type based on independent data marts over a hub-and-spoke architecture type.
  • The frequent need for integrating preexisting data warehouses, possibly deployed on heterogeneous platforms, and the pressing demand for uniformly accessing their data can require a federated architecture type.


1.4 Data Staging and ETL

Now let's closely study some basic features of the different architecture layers. We will start with the data staging layer.

The data staging layer hosts the ETL processes that extract, integrate, and clean data from operational sources to feed the data warehouse layer. In a three-layer architecture, ETL processes actually feed the reconciled data layer—a single, detailed, comprehensive, top-quality data source—that in its turn feeds the data warehouse. For this reason, the ETL process operations as a whole are often defined as reconciliation. These are also the most complex and technically challenging among all the data warehouse process phases.

ETL takes place once when a data warehouse is populated for the first time, then it occurs every time the data warehouse is regularly updated. Figure 1-8 shows that ETL consists of four separate phases: extraction (or capture), cleansing (or cleaning or scrubbing), transformation, and loading. In the following sections, we offer brief descriptions of these phases.

Note: Refer to Jarke et al., 2000; Hoffer et al., 2005; Kimball and Caserta, 2004; and English, 1999 for more details on ETL.

The scientific literature shows that the boundaries between cleansing and transforming are often blurred from the terminological viewpoint. For this reason, a specific operation is not always clearly assigned to one of these phases. This is obviously a formal problem, but not a substantial one. We will adopt the approach used by Hoffer and others (2005) to make our explanations as clear as possible. Their approach states that cleansing is essentially aimed at rectifying data values, and transformation more specifically manages data formats.

Chapter 10 discusses all the details of the data-staging design phase. Chapter 3 deals with an early data warehouse design phase: integration. This phase is necessary if there are heterogeneous sources to define a schema for the reconciled data layer, and to specifically transform operational data in the data-staging phase.


1.4.1 Extraction
Relevant data is obtained from sources in the extraction phase. You can use static extraction when a data warehouse needs populating for the first time. Conceptually speaking, this looks like a snapshot of operational data. Incremental extraction, used to update data warehouses regularly, seizes the changes applied to source data since the latest extraction. Incremental extraction is often based on the log maintained by the operational DBMS. If a timestamp is associated with operational data to record exactly when the data is changed or added, it can be used to streamline the extraction process. Extraction can also be source-driven if you can rewrite operational applications to asynchronously notify of the changes being applied, or if your operational database can implement triggers associated with change transactions for relevant data.

The data to be extracted is mainly selected on the basis of its quality (English, 1999). In particular, this depends on how comprehensive and accurate the constraints implemented in sources are, how suitable the data formats are, and how clear the schemata are.

Figure 1-8: Extraction, transformation, and loading
 Extraction, transformation, and loading data


1.4.2 Cleansing
The cleansing phase is crucial in a data warehouse system because it is supposed to improve data quality—normally quite poor in sources (Galhardas et al., 2001). The following list includes the most frequent mistakes and inconsistencies that make data "dirty":

  • Duplicate data: For example, a patient is recorded many times in a hospital patient management system
  • Inconsistent values that are logically associated: Such as addresses and ZIP codes
  • Missing data: Such as a customer's job
  • Unexpected use of fields: For example, a socialSecurityNumber field could be used improperly to store office phone numbers
  • Impossible or wrong values: Such as 2/30/2009
  • Inconsistent values for a single entity because different practices were used: For example, to specify a country, you can use an international country abbreviation (I) or a full country name (Italy); similar problems arise with addresses (Hamlet Rd. and Hamlet Road)
  • Inconsistent values for one individual entity because of typing mistakes: Such as Hamet Road instead of Hamlet Road

In particular, note that the last two types of mistakes are very frequent when you are managing multiple sources and are entering data manually.

The main data cleansing features found in ETL tools are rectification and homogenization. They use specific dictionaries to rectify typing mistakes and to recognize synonyms, as well as rule-based cleansing to enforce domain-specific rules and define appropriate associations between values. See section 10.2 for more details on these points.


1.4.3 Transformation
Transformation is the core of the reconciliation phase. It converts data from its operational source format into a specific data warehouse format. If you implement a three-layer architecture, this phase outputs your reconciled data layer. Independently of the presence of a reconciled data layer, establishing a mapping between the source data layer and the data warehouse layer is generally made difficult by the presence of many different, heterogeneous sources. If this is the case, a complex integration phase is required when designing your data warehouse. See Chapter 3 for more details.

The following points must be rectified in this phase:

  • Loose texts may hide valuable information. For example, BigDeal LtD does not explicitly show that this is a Limited Partnership company.
  • Different formats can be used for individual data. For example, a date can be saved as a string or as three integers.

Following are the main transformation processes aimed at populating the reconciled data layer:

  • Conversion and normalization that operate on both storage formats and units of measure to make data uniform
  • Matching that associates equivalent fields in different sources
  • Selection that reduces the number of source fields and records

When populating a data warehouse, normalization is replaced by denormalization because data warehouse data are typically denormalized, and you need aggregation to sum up data properly.

Figure 1-9: Example of cleansing and transforming customer data
Cleansing and transforming customer data example

Cleansing and transformation processes are often closely connected in ETL tools. Figure 1-9 shows an example of cleansing and transformation of customer data: a field-based structure is extracted from a loose text, then a few values are standardized so as to remove abbreviations, and eventually those values that are logically associated can be rectified.


1.4.4 Loading
Loading into a data warehouse is the last step to take. Loading can be carried out in two ways:

  • Refresh Data warehouse data is completely rewritten. This means that older data is replaced. Refresh is normally used in combination with static extraction to initially populate a data warehouse.
  • Update Only those changes applied to source data are added to the data warehouse. Update is typically carried out without deleting or modifying preexisting data. This technique is used in combination with incremental extraction to update data warehouses regularly.

More on data warehouse design:



Digg This!    StumbleUpon Toolbar StumbleUpon    Bookmark with Delicious Del.icio.us    Add to Google



RELATED CONTENT
Data warehouse basics
Definition of primary, super, foreign and candidate key in the DBMS
What is the difference between a logical and physical warehouse design?
Best practices for data warehouse access and reports
Similarities and differences between ROLAP, MOLAP and HOLAP
An introduction to data warehousing and decision support systems
Advantages of the multidimensional database model and cube modeling
Data warehousing, data mining and data querying: Terms and definitions
What is an operational data store vs. a data warehouse?
Why you should build a data warehouse
Data mart vs. data warehouse

Enterprise data architecture best practices
Is in-database analytics an emerging business intelligence (BI) trend?
Understanding in-database analytics technology: Benefits, uses and ROI
Advantages and disadvantages of XML shredding
How to shred XML with the DB2 XMLTABLE function
Shredding XML docs into relational tables with annotated XML schemas
Teradata takes a logical approach to data warehousing appliances
Examples of single and bulk XML shredding of XML documents
What is the difference between a logical and physical warehouse design?
What are some emerging data warehouse and DBMS trends?
Teradata VP talks data warehouse appliances, reveals cloud and SSD plans

Database management systems (DBMS) architecture and design
Definition of primary, super, foreign and candidate key in the DBMS
What is the difference between a logical and physical warehouse design?
What are some emerging data warehouse and DBMS trends?
Data Warehouse Platforms Product Directory
Designing for performance: Strategic database application deployments
An introduction to database transaction management
Database access security: network authentication or data encryption?
Executing SQL statements using prepared statements and statement pooling
Static SQL vs. dynamic SQL for database application performance
How to get data/database independence with a three-tier architecture

RELATED GLOSSARY TERMS
Terms from Whatis.com − the technology online dictionary
data classification  (SearchDataManagement.com)
OLAP  (SearchDataManagement.com)

RELATED RESOURCES
2020software.com, trial software downloads for accounting software, ERP software, CRM software and business software systems
Search Bitpipe.com for the latest white papers and business webcasts
Whatis.com, the online computer dictionary




Data Compliance Articles and Research: Data Privacy, Financial Data Management, Healthcare Data
About Us  |  Contact Us  |  For Advertisers  |  For Business Partners  |  Site Index  |  RSS
SEARCH 
TechTarget provides technology professionals with the information they need to perform their jobs - from developing strategy, to making cost-effective purchase decisions and managing their organizations' technology projects - with its network of technology-specific websites, events and online magazines.

TechTarget Corporate Web Site  |  Media Kits  |  Site Map




All Rights Reserved, Copyright 2005 - 2009, TechTarget | Read our Privacy Policy
  TechTarget - The IT Media ROI Experts