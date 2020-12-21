The vast amount of data organizations collect from various sources goes beyond what traditional relational databases can handle, creating the need for additional systems and tools to manage the data. This leads to the data warehouse vs. data lake question -- when to use which one and how they compare to each other.

All of these data repositories have a similar core function: housing data for business reporting and analysis. But they differ in their purpose, their structure, the types of data they store, where the data comes from and who has access to it.

In general, data comes into these repositories from systems that generate data -- CRM, ERP, HR, financial applications and other sources. The data records created from those systems are applied according to business rules and then sent to a data warehouse, data lake or other data storage area.

Once all the data from the disparate business applications is collated onto one data platform, it can be used in data analytics tools to identify trends or deliver insights to help make business decisions.

What is a data lake? A data lake is a vast repository that stores raw data in its native format. One benefit to a data lake is that it can store data of varying structures. Each stored data element is tagged with a unique identifier and metadata so it can be queried more easily when needed. Data lakes have no predefined schema, and analysts can apply the schema after the ingestion process is complete. Data lakes are most commonly associated with a Hadoop framework, but data lakes are a supported architecture to many vendors as the influx of data continues to grow. Many vendors also support data lakes in the cloud.

What is a data warehouse? A data warehouse is a repository for data collected and generated by business applications for a predetermined purpose. Data warehouses apply a predefined schema to data before storage, and data must be cleaned and organized before being stored in this repository. Because data stored in a data warehouse is already processed, it is easier for high-level analysis. BI tools can easily access and use the processed data from a data warehouse, making it simpler for non-data professionals to use data warehouses.

Data warehouse vs. data lake Organizations typically opt for a data warehouse vs. a data lake when they have a massive amount of data from operational systems that needs to be readily available for analysis. Data warehouses often serve as the single source of truth because these platforms store historical data that has been cleansed and categorized. The key differences between a data warehouse vs. data lake While data warehouses retain massive amounts of data from operational systems, a data lake stores data from more sources. A data lake platform is essentially a collection of various raw data assets that come from an organization's operational systems and other sources, often including both internal and external ones. Because the data within data lakes may be uncurated and can originate from sources outside of the company's operational systems, it isn't a good fit for the average business analytics user; rather, data lakes are the playground of data scientists and other data analytics experts. To remember the difference between a data warehouse and data lake, picture actual warehouses and lakes: Warehouses store curated goods from specific sources, whereas a lake is fed from rivers, streams and other unfiltered sources of water. Data warehouse vendors include AWS, Cloudera, IBM, Google, Microsoft, Oracle, Teradata, SAP, SnapLogic and Snowflake, to name some of the many options. Data lakes are available from AWS, Google, Informatica, Microsoft, Teradata and other data management providers.