Hadoop Distributed File System (HDFS)

The Hadoop Distributed File System (HDFS) is the primary data storage system used by Hadoop applications. It employs a NameNode and DataNode architecture to implement a distributed file system that provides high-performance access to data across highly scalable Hadoop clusters.

HDFS is a key part of the many Hadoop ecosystem technologies, as it provides a reliable means for managing pools of big data and supporting related big data analytics applications.

How HDFS works

HDFS supports the rapid transfer of data between compute nodes. At its outset, it was closely coupled with MapReduce, a programmatic framework for data processing.

When HDFS takes in data, it breaks the information down into separate blocks and distributes them to different nodes in a cluster, thus enabling highly efficient parallel processing.

Moreover, the Hadoop Distributed File System is specially designed to be highly fault-tolerant. The file system replicates, or copies, each piece of data multiple times and distributes the copies to individual nodes, placing at least one copy on a different server rack than the others. As a result, the data on nodes that crash can be found elsewhere within a cluster. This ensures that processing can continue while data is recovered.

HDFS uses master/slave architecture. In its initial incarnation, each Hadoop cluster consisted of a single NameNode that managed file system operations and supporting DataNodes that managed data storage on individual compute nodes. The HDFS elements combine to support applications with large data sets.

This master node "data chunking" architecture takes as its design guides elements from Google File System (GFS), a proprietary file system outlined in in Google technical papers, as well as IBM's General Parallel File System (GPFS), a format that boosts I/O by striping blocks of data over multiple disks, writing blocks in parallel. While HDFS is not Portable Operating System Interface model-compliant, it echoes POSIX design style in some aspects.

HDFS schema
HDFS architecture centers on commanding NameNodes that hold metadata and DataNodes that store information in blocks. Working at the heart of Hadoop, HDFS can replicate data at great scale.

Why use HDFS?

The Hadoop Distributed File System arose at Yahoo as a part of that company's ad serving and search engine requirements. Like other web-oriented companies, Yahoo found itself juggling a variety of applications that were accessed by a growing numbers of users, who were creating more and more data. Facebook, eBay, LinkedIn and Twitter are among the web companies that used HDFS to underpin big data analytics to address these same requirements.

But the file system found use beyond that. HDFS was used by The New York Times as part of large-scale image conversions, Media6Degrees for log processing and machine learning, LiveBet for log storage and odds analysis, Joost for session analysis and Fox Audience Network for log analysis and data mining. HDFS is also at the core of many open source data warehouse alternatives, sometimes called data lakes.

Because HDFS is typically deployed as part of very large-scale implementations, support for low-cost commodity hardware is a particularly useful feature. Such systems, running web search and related applications, for example, can range into the hundreds of petabytes and thousands of nodes. They must be especially resilient, as server failures are common at such scale.

HDFS and Hadoop history

In 2006, Hadoop's originators ceded their work on HDFS and MapReduce to the Apache Software Foundation project. The software was widely adopted in big data analytics projects in a range of industries. In 2012, HDFS and Hadoop became available in Version 1.0.

The basic HDFS standard has been continuously updated since its inception.

With Version 2.0 of Hadoop in 2013, a general-purpose YARN resource manager was added, and MapReduce and HDFS were effectively decoupled. Thereafter, diverse data processing frameworks and file systems were supported by Hadoop. While MapReduce was often replaced by Apache Spark, HDFS continued to be a prevalent file format for Hadoop.

After four alpha releases and one beta, Apache Hadoop 3.0.0 became generally available in December 2017, with HDFS enhancements supporting additional NameNodes, erasure coding facilities and greater data compression. At the same time, advances in HDFS tooling, such as LinkedIn's open source Dr. Elephant and Dynamometer performance testing tools, have expanded to enable development of ever larger HDFS implementations.

This was last updated in February 2018

Continue Reading About Hadoop Distributed File System (HDFS)

Dig Deeper on Hadoop framework