The vast amount of data organizations collect from various sources goes beyond what traditional relational databases can handle, creating the need for additional systems and tools to manage the data. This leads to the data warehouse vs. data lake question -- when to use which one and how each compares to data marts, operational data stores and relational databases.
All of these data repositories have a similar core function: housing data for business reporting and analysis. But their purpose, their structure, the types of data they store, where the data comes from and who has access to it differs.
In general, data comes into these repositories from systems that generate data -- CRM, ERP, HR, financial applications and other sources. The data records created from those systems are applied against business rules and then sent to a data warehouse, data lake or other data storage area.
Once all the data from the disparate business applications is collated onto one data platform, it can be used in data analytics tools to identify trends or deliver insights to help make business decisions.
This article is part of
Data warehouse vs. data lake
Organizations typically opt for a data warehouse vs. a data lake when they have a massive amount of data from operational systems that needs to be readily available for analysis. Data warehouses often serve as the single source of truth because these platforms store historical data that has been cleansed and categorized.
While data warehouses retain massive amounts of data from operational systems, a data lake stores data from more sources. A data lake platform is essentially a collection of various raw data assets that come from an organization's operational systems and other sources, often including both internal and external ones.
Because the data within data lakes may be uncurated and can originate from sources outside of the company's operational systems, it isn't a good fit for the average business analytics user; rather, data lakes are the playground of data scientists and other data analytics experts.
To remember how a data warehouse vs. data lake differ, picture actual warehouses and lakes: warehouses store curated goods from specific sources, whereas a lake is fed from rivers, streams and other unfiltered sources of water.
Data warehouse vendors include AWS, Cloudera, IBM, Google, Microsoft, Oracle, Teradata, SAP, SnapLogic and Snowflake, to name some of the many options. Data lakes are available from AWS, Google, Informatica, Microsoft, Teradata and other data management providers.
Data warehouse vs. data mart
Data marts are often confused with data warehouses, but the two serve markedly different purposes.
A data mart is typically a subset of a data warehouse; the data within it often comes from a data warehouse -- though it can come from another source. The data sent to a data mart is highly curated for a specific community of users -- such as a sales team -- to allow them to find the data they need quickly. The data is held there for specific uses, such as financial analytics.
Data marts are also much smaller than data warehouses -- they hold tens of gigabytes vs. the hundreds of gigabytes to petabytes of data that can be held in a data warehouse, ready for data processing.
Data marts can be built from an existing data warehouse or other data source system by designing and constructing the database table, populating it with relevant data and deciding who can access it.
Data warehouses vs. ODS
An operational data store (ODS) is a type of database that serves as the interim holding area for all the data that's about to enter the warehouse for data processing. Think of it as the warehouse loading dock, where goods are delivered, examined and verified. While in the ODS, data can be scrubbed, checked for redundancy and checked for compliance with business rules before entering the warehouse.
Queries can be made against data in the ODS, but the data there is transient, so it only supplies information for queries about, say, the status of a customer order that's in progress.
An ODS is typically run on a relational database management system (RDBMS) or on the Hadoop platform. Data is supplied to the ODS using data integration and data ingestion tools, such as Attunity Replicate or Hortonworks DataFlow.
Relational databases vs. data warehouses and data lakes
The main difference between a data warehouse vs. data lake vs. relational database system is that a relational database is used to store and organize structured data from a single source, such as a transactional system, while data warehouses are built to hold structured data from multiple sources. Data lakes differ from both in that they store unstructured, semistructured and structured data.
Relational databases are relatively simple to create and can be used to store and organize real-time data like transaction data. The downside of relational databases is that they don't support unstructured data or the vast amount of data being generated today. That brings us to the data warehouse vs. data lake decision. Still, many companies continue to rely on relational databases today for tasks such as operational data analysis or trend analysis.
Relational databases available on premises or in the cloud include Microsoft SQL Server, Oracle Database, MySQL and IBM Db2, as well as Amazon Relational Database Service, Google Cloud Spanner and others.
How an operational data store and data warehouse differ
Seven steps to a successful data lake implementation