BACKGROUND IMAGE: iSTOCK/GETTY IMAGES
A data warehouse is a federated repository for all the data collected by an enterprise's various operational systems. Data warehousing emphasizes the capture of data from different sources for access and analysis.
Typically, a data warehouse is a relational database housed on an enterprise mainframe server or, increasingly, in the cloud. Data from various online transaction processing (OLTP) applications and other sources are selectively extracted for business intelligence activities, decision support and to answer user inquiries. Data warehouses are also used for online analytical processing (OLAP).
Basic components of a data warehouse
A data warehouse stores data that is extracted from data stores and external sources. The data records within the warehouse must contain details to make it searchable and useful to business users. Taken together, there are three main components of data warehousing:
- Data sources from operational systems, such as Excel, ERP, CRM or financial applications.
- A data staging area where data is cleaned and ordered.
- A presentation area where data is warehoused.
These can also be understood as tiers, where the bottom tier is the database server, the middle tier is the analytics engine and the top tier is data warehouse software which presents reporting and analysis.
Data analysis tools, such as business intelligence software, access the data within the warehouse. Data warehouses can also feed data marts, which are decentralized systems in which data from the warehouse is organized and made available to specific business groups, such as sales or inventory teams.
In addition, Hadoop has become an important extension of data warehouses for many enterprises because the data processing platform can improve components of the data warehouse architecture -- from data ingestion to analytics processing to data archiving.
Beyond basic data warehouses
Data warehouse benefits and options
Data warehouses can benefit organizations from both an IT and a business perspective. For example:
- Separating the analytical processes from the operational processes can enhance the operational systems and enable business users to access and query relevant data faster from multiple sources.
- Data warehouses can offer enhanced data quality and consistency, thereby improving business intelligence.
- Businesses can choose on-premises, the cloud or data-warehouse-as-a-service.
- On-premises data warehouses offer flexibility and security so IT teams can maintain control over their data warehouse management and configuration; from IBM, Oracle and Teradata as an example.
- Cloud-based data warehouses such as Amazon Redshift, Google BigQuery, Microsoft Azure SQL Data Warehouse and Snowflake enable companies to quickly scale while eliminating the initial infrastructure investments and ongoing maintenance requirements.
Types of data warehouses
There are three main approaches to implementing a data warehouse. Some organizations have also adopted hybrid options.
- Top-down approach: Inmon's method calls for building the data warehouse first. Data is extracted from operational and possibly third-party external systems and may be validated in a staging area before being integrated into a normalized data model. Data marts are created from the data stored in the data warehouse.
- Bottom-up method: Kimball's data warehousing architecture calls for dimensional data marts to be created first. Data is extracted from operational systems, moved to a staging area and modeled into a star schema design, with one or more fact tables connected to one or more dimensional tables. The data is then processed and loaded into data marts, each of which focuses on a specific business process. Data marts are integrated using a data warehouse bus architecture to form an enterprise data warehouse.
- Hybrid method: Hybrid approaches to data warehouse design include aspects from both the top-down and bottom-up methods. Organizations often seek to combine the speed of the bottom-up approach with the integration achieved in a top-down design.
Data warehouses vs. databases vs. data lakes
Databases and data lakes are often confused with data warehouses, but there are important differences. While data warehouses typically store data from multiple sources and utilize predefined schemas designed for data analytics, a database is generally used to capture and store data from a single source, such as a transactional system, and its schema is normalized. Databases are not designed to run across very large data sets.
By contrast, a data lake is a central repository for all types of raw data, whether structured or unstructured, from multiple sources. The schema is not defined, which allows for more types of analytics than data warehouses, which have defined schemas. Data lakes can be used for text searches, machine learning and real-time analytics.
Data warehouse innovations throughout history
The concept of data warehousing can be traced back to work conducted in the mid-1980s by IBM researchers Barry Devlin and Paul Murphy. The duo coined the term business data warehouse in their 1988 paper, "An architecture for a business and information system," which stated:
“The [business information system] architecture is based on the assumption that such a service runs against a repository of all required business information that is known as the Business Data Warehouse (BDW). ... A necessary prerequisite for the physical implementation of a business data warehouse service is a business process and information architecture that defines (1) the reporting flow between functions and (2) the data required.”
William H. Inmon furthered data warehouse development with his 1992 book Building the Data Warehouse, as well as by writing some of the first columns about the topic.
Inmon also created one of the most well-known methods for designing a data warehouse. His approach -- known as top-down design -- describes the technology as a subject-oriented, integrated, time-variant and nonvolatile collection of data that supports an organization's decision-making process.
The technology's growth continued with the founding of The Data Warehousing Institute, known as TDWI, in 1995, and with the 1996 publication of Ralph Kimball's book The Data Warehouse Toolkit. Kimball introduced the dimensional modeling approach to data warehouse design, a bottom-up approach in which the organization builds data marts first and then combines them into a single, all-encompassing data warehouse.
In 2008, Inmon introduced the concept of data warehouse 2.0, which focuses on the inclusion of unstructured data and corporate metadata.