Hadoop data lake

This definition is part of our Essential Guide: Using big data platforms for data management, access and analytics
Contributor(s): Jack Vaughan

A Hadoop data lake is a data management platform comprising one or more Hadoop clusters used principally to process and store non-relational data such as log files , Internet clickstream records, sensor data, JSON objects, images and social media posts. Such systems can also hold transactional data pulled from relational databases, but they're designed to support analytics applications, not to handle transaction processing themselves.

While the data lake concept can be applied more broadly to include other types of systems, it most frequently involves storing data in the Hadoop Distributed File System (HDFS) across a set of clustered compute nodes based on commodity server hardware.

Because of the use of commodity hardware and Hadoop's standing as an open source technology, proponents claim that Hadoop data lakes provide a less expensive repository for analytics data than traditional data warehouses do. In addition, their ability to hold a diverse mix of structured, unstructured and semi-structured information can make them a more suitable platform for big data management and analytics applications than data warehouses based on relational software are.

However, a Hadoop data lake can be used to complement an enterprise data warehouse rather than to supplant it entirely. A Hadoop cluster can offload some data processing work from an EDW and host new analytics applications; filtered data sets or summarized results can then be sent to the data warehouse for further analysis by business users.

The contents of a Hadoop data lake need not be immediately incorporated into a formal database schema or consistent data structure, enabling users to store raw data as is; information can then either be analyzed in its raw form or prepared for specific analytics uses as needed. As a result, data lake systems tend to employ extract, load and transform (ELT) methods for collecting and integrating data, instead of the extract, transform and load (ETL) approaches typically used in data warehouses. Data can be extracted and processed outside of HDFS using MapReduce, Spark and other data processing frameworks.

Potential uses are varied. For example, Hadoop data lakes can pool varied legacy data sources, collect network data from multiple remote locations and serve as a way staion for data that is overloading another system.

The Hadoop data lake isn't without its critics, or challenges for users. Spark as well as the Hadoop framework itself can support file architectures other than HDFS. Meanwhile, data warehouse advocates contend that similar architectures -- for example, the data mart -- have a long lineage and that Hadoop and related open source technologies still need to mature significantly in order to match the functionality and reliability of data warehousing environments. Even experienced Hadoop data lake users say that a successful implementation requires a strong architecture and disciplined data governance policies -- without those things, they warn, data lake systems can become out-of-control dumping grounds.

This was last updated in May 2015

Continue Reading About Hadoop data lake



Find more PRO+ content and other member only offers, here.

Join the conversation

1 comment

Send me notifications when other members comment.

By submitting you agree to receive email from TechTarget and its partners. If you reside outside of the United States, you consent to having your personal data transferred to and processed in the United States. Privacy

Please create a username to comment.

Do you think the Hadoop data lake is a viable data architecture?


File Extensions and File Formats

Powered by: