A data warehouse is a federated repository for all the data collected by an enterprise's various operational systems, be they physical or logical. Data warehousing emphasizes the capture of data from diverse sources for access and analysis rather than for transaction processing.
Typically, a data warehouse is a relational database housed on an enterprise mainframe server or, increasingly, in the cloud. Data from various online transaction processing (OLTP) applications and other sources are selectively extracted for business intelligence activities, decision support and to answer user inquiries.
Basic components of a data warehouse
A data warehouse stores data that is extracted from data stores and external sources. The data records within the warehouse must contain details to make it searchable and useful to business users. Taken together, there are three main components of data warehousing:
- data sources from operational systems, such as Excel, ERP, CRM or financial applications;
- a data staging area where data is cleaned and ordered; and
- a presentation area where data is warehoused.
Data analysis tools, such as business intelligence software, access the data within the warehouse. Data warehouses can also feed data marts, which are decentralized systems in which data from the warehouse is organized and made available to specific business groups, such as sales or inventory teams.
In addition, Hadoop has become an important extension of data warehouses for many enterprises because the data processing platform can improve components of the data warehouse architecture -- from data ingestion to analytics processing to data archiving.
Data warehouse benefits and options
Data warehouses can benefit organizations from an both IT and a business perspective. Separating the analytical processes from the operational processes can enhance the operational systems and enable business users to access and query relevant data faster from multiple sources. In addition, data warehouses can offer enhanced data quality and consistency, thereby improving business intelligence.
Beyond basic data warehouses
Businesses can choose on-premises, the cloud or data-warehouse-as-a-service systems. On-premises data warehouses from IBM, Oracle and Teradata offer flexibility and security so IT teams can maintain control over their data warehouse management and configuration.
Cloud-based data warehouses such as Amazon Redshift, Google BigQuery, Microsoft Azure SQL Data Warehouse and Snowflake enable companies to quickly scale while eliminating the initial infrastructure investments and ongoing maintenance requirements.
Data warehouse innovations throughout history
The concept of data warehousing can be traced back to work conducted in the mid-1980s by IBM researchers Barry Devlin and Paul Murphy. The duo coined the term business data warehouse in their 1988 paper "An architecture for a business and information system," which stated:
The [business information system] architecture is based on the assumption that such a service runs against a repository of all required business information that is known as the Business Data Warehouse (BDW). ... A necessary prerequisite for the physical implementation of a business data warehouse service is a business process and information architecture that defines (1) the reporting flow between functions and (2) the data required.
William H. Inmon furthered data warehouse development with his 1992 book Building the Data Warehouse, as well as by writing some of the first columns about the topic.
Inmon also created one of the most well-known methods for designing a data warehouse. His approach -- known as top-down design -- describes the technology as a subject-oriented, integrated, time-variant and nonvolatile collection of data that supports an organization's decision-making process.
The technology's growth continued with the founding of The Data Warehousing Institute, known as TDWI, in 1995, and with the 1996 publication of Ralph Kimball's book The Data Warehouse Toolkit. Kimball introduced the dimensional modeling approach to data warehouse design, a bottom-up approach in which the organization builds data marts first and then combines them into a single, all-encompassing data warehouse.
In 2008, Inmon introduced the concept of data warehouse 2.0, which focuses on the inclusion of unstructured data and corporate metadata.
Data warehouse design methods
In addition to Inmon's top-down approach to data warehouses and Kimball's bottom-up method, some organizations have also adopted hybrid options.
- Top-down approach: Inmon's method calls for building the data warehouse first. Data is extracted from operational and possibly third-party external systems and may be validated in a staging area before being integrated into a normalized data model. Data marts are created from the data stored in the data warehouse.
- Bottom-up method: Kimball's data warehousing architecture calls for dimensional data marts to be created first. Data is extracted from operational systems, moved to a staging area and modeled into a star schema design, with one or more fact tables connected to one or more dimensional tables. The data is then processed and loaded into data marts, each of which focuses on a specific business process. Data marts are integrated using a data warehouse bus architecture to form an enterprise data warehouse.
- Hybrid method: Hybrid approaches to data warehouse design include aspects from both the top-down and bottom-up methods. Organizations often seek to combine the speed of the bottom-up approach with the integration achieved in a top-down design.
Data warehouses vs. databases vs. data lakes
Databases and data lakes are often confused with data warehouses, but there are important differences.
While data warehouses typically store data from multiple sources and utilize predefined schemas designed for data analytics, a database is generally used to capture and store data from a single source, such as a transactional system, and its schema is normalized. Databases aren't designed to run across very large data sets.
By contrast, a data lake is a central repository for all types of raw data, whether structured or unstructured, from multiple sources. The schema is not defined, which allows for more types of analytics than data warehouses, which have defined schemas. Data lakes can be used for text searches, machine learning and real-time analytics.