Hadoop and Spark are two of the most popular processing frameworks for big data architectures. Both provide a rich ecosystem of open source technologies for preparing, processing and managing sets of big data and running analytics applications on them.
Most debates on using Hadoop vs. Spark revolve around optimizing big data environments for batch processing or real-time processing. But that oversimplifies the differences between the two frameworks, formally known as Apache Hadoop and Apache Spark. While Hadoop initially was limited to batch applications, it -- or at least some of its components -- can now also be used in interactive querying and real-time analytics workloads. Spark, meanwhile, was first developed to process batch jobs more quickly than was possible with Hadoop.
Also, it isn't necessarily an either-or choice. Many organizations run both platforms for different big data use cases, and Spark applications are often built on top of Hadoop's YARN resource management technology and the Hadoop Distributed File System (HDFS). HDFS is one of the main data storage options for Spark, which doesn't have its own file system or repository.
A key distinction is the Hadoop MapReduce processing engine and programming model. HDFS was tied to it in the first versions of Hadoop, while Spark was created specifically to replace MapReduce. Even though Hadoop no longer depends exclusively on MapReduce, there's still a strong association between them. "In the minds of many, Hadoop is synonymous with Hadoop MapReduce," said Erik Gfesser, principal architect at IT services and consulting firm SPR.
MapReduce has some advantages when it comes to keeping costs down for large processing jobs that can tolerate some delays. Spark has a clear advantage in delivering timely analytics insights because it's designed to process mostly in memory. Here's a closer look at the components, features and capabilities of Hadoop and Spark and their key differences.
What is Hadoop?
First released in 2006, Hadoop was created by software engineers Doug Cutting and Mike Cafarella to process large amounts of data using its namesake file system and a novel distributed computing technique called MapReduce that Google promoted in a 2004 technical paper. Hadoop provides a way to efficiently break up large data processing problems across different computers, run computations locally and then combine the results. The architecture makes it easy to build big data applications for clusters containing hundreds or thousands of commodity servers, called nodes.
The main components of Hadoop include the following technologies:
- HDFS. Initially modeled on a file system developed by Google, HDFS manages the process of distributing, storing and accessing data across many separate servers. It can handle both structured and unstructured data, which makes it a suitable choice for building out a data lake.
- YARN. Short for Yet Another Resource Negotiator but typically referred to by its acronym, YARN is Hadoop's cluster resource manager, responsible for executing distributed workloads. It schedules jobs and allocates compute resources such as CPU and memory to applications. YARN took over those tasks from MapReduce when it was added as part of Hadoop 2.0 in 2013.
- MapReduce. While its role was reduced by YARN, MapReduce is still the built-in processing engine used to run large-scale batch applications in many Hadoop clusters. It orchestrates the process of splitting large computations into smaller ones that can be spread out across different cluster nodes and then runs the various processing jobs.
- Hadoop Common. This is a set of underlying utilities and libraries used by Hadoop's other components.
What is Spark?
Spark was initially developed by Matei Zaharia in 2009, while he was a graduate student at the University of California, Berkeley. His main innovation with the technology was to improve how data is organized to scale in-memory processing across distributed cluster nodes more efficiently. Like Hadoop, Spark can process vast amounts of data by splitting up workloads on different nodes, but it typically does so much faster. This enables it to handle use cases that Hadoop can't with MapReduce, making Spark more of a general-purpose processing engine.
The following technologies are among Spark's key components:
- Spark Core. This is the underlying execution engine that provides job scheduling and coordinates basic I/O operations, using Spark's basic API.
- Spark SQL. The Spark SQL module enables users to do optimized processing of structured data by directly running SQL queries or using Spark's Dataset API to access the SQL execution engine.
- Spark Streaming and Structured Streaming. These modules add stream processing capabilities. Spark Streaming takes data from different streaming sources, including HDFS, Kafka and Kinesis, and divides it into micro-batches to represent a continuous stream. Structured Streaming is a newer approach built on Spark SQL that's designed to reduce latency and simplify programming.
- MLlib. A built-in machine learning library, MLlib includes a set of machine learning algorithms, plus tools for feature selection and building machine learning pipelines.
The fundamental architectural difference between Hadoop and Spark relates to how data is organized for processing. In Hadoop, all the data is split into blocks that are replicated across the disk drives of the various servers in a cluster, with HDFS providing high levels of redundancy and fault tolerance. Hadoop applications can then be run as a single job or a directed acyclic graph (DAG) that contains multiple jobs.
In Hadoop 1.0, a centralized JobTracker service allocated MapReduce tasks across nodes that could run independently of each other, and a local TaskTracker service managed job execution by individual nodes. Starting in Hadoop 2.0, though, JobTracker and TaskTracker were replaced with these components of YARN:
- A ResourceManager daemon that functions as a global job scheduler and resource arbitrator;
- NodeManager, an agent that's installed on each cluster node to monitor resource usage;
- ApplicationMaster, a daemon created for each application that negotiates required resources from ResourceManager and works with NodeManagers to execute processing tasks; and
- Resource containers that hold, in an abstract way, the system resources assigned to different nodes and applications.
In Spark, data is accessed from external storage repositories, which could be HDFS, a cloud object store like Amazon Simple Storage Service or various databases and other data sources. While most processing is done in memory, the platform can also "spill" data to disk storage and process it there when data sets are too large to fit into the available memory. Spark can run on clusters managed by YARN, Mesos and Kubernetes or in a standalone mode.
Similar to Hadoop, Spark's architecture has changed significantly from its original design. In early versions, Spark Core organized data into a resilient distributed dataset (RDD), an in-memory data store that is distributed across the various nodes in a cluster. It also created DAGs to help in scheduling jobs for efficient processing.
The RDD API is still supported. But starting with Spark 2.0, which was released in 2016, it was replaced as the recommended programming interface by the Dataset API. Like RDDs, Datasets are distributed collections of data with strong typing features, but they include richer optimizations through Spark SQL to help boost performance. The updated architecture also includes DataFrames, which are Datasets with named columns, making them similar in concept to relational database tables or data frames in R and Python applications. Structured Streaming and MLlib both utilize the Dataset/DataFrame approach.
Data processing capabilities
Hadoop and Spark are both distributed big data frameworks that can be used to process large volumes of data. Despite the expanded processing workloads enabled by YARN, Hadoop is still oriented mainly to MapReduce, which is well suited for long-running batch jobs that don't have strict service-level agreements.
Spark, on the other hand, typically can run batch workloads as an alternative to MapReduce and also provides higher-level APIs for several other processing use cases. In addition to the SQL, stream processing and machine learning modules, that includes a GraphX API for graph processing and SparkR and PySpark interfaces for R and Python, respectively.
Hadoop processing with MapReduce tends to be slow and can be challenging to manage. Spark is often considerably faster for many kinds of batch processing: Proponents claim it can perform up to 100 times faster than an equivalent workload on Hadoop when processing batch jobs in memory.
One big contributor to this is that Spark can do processing without having to write data back to disk storage as an interim step. But even Spark applications written to run on disk can see 10 times faster performance than comparable MapReduce workloads on Hadoop, according to Spark's developers.
But Hadoop may have an advantage when it comes to managing many longer-running workloads on the same cluster simultaneously. Running a lot of Spark applications at the same time can sometimes create memory issues that slow the performance of all the applications.
As a general principle, Hadoop systems can scale to accommodate larger data sets that are sporadically accessed because the data can be stored and processed more cost-effectively on disk drives versus memory. A YARN Federation feature added in Hadoop 3.0, which was released in 2017, enables clusters to support tens of thousands of nodes or more by connecting multiple "subclusters" that have their own resource managers.
The downside is that IT and big data teams may have to invest in more labor for on-premises implementations to provision new nodes and add them to a cluster. Also, with Hadoop, storage is colocated with compute resources on the cluster nodes, which can make it difficult for applications and users outside of the cluster to access the data. But some of these scalability issues can be automatically managed with Hadoop services in the cloud.
One of Spark's main advantages is that storage and compute are separated, which can make it easy for applications and users to access the data from anywhere. Spark includes tools that can help users dynamically scale nodes up and down depending on workload requirements; it's also easier to automatically reallocate nodes at the end of a processing cycle in Spark. A scaling challenge with Spark applications is ensuring that workloads are separated across nodes independent of each other to reduce memory leakage.
Applications and use cases
Both Hadoop MapReduce and Spark are often used for batch processing jobs, such as extract, transform and load tasks to move data into a data lake or data warehouse. They both can also handle various big data analytics applications involving recent or historical data, such as customer analytics, predictive modeling, business forecasting, risk management and cyber threat intelligence.
Spark is often a better choice for data streaming and real-time analytics use cases, such as fraud detection, predictive maintenance, stock trading, recommendation engines, targeted advertising and airfare and hotel pricing. It's also typically a better fit for running quick analyses, graph computations and machine learning applications. In addition to including MLlib, Spark is now the recommended back-end platform for Apache Mahout, a machine learning and distributed linear algebra framework that initially was built on top of Hadoop MapReduce.
Deployment and processing costs
Organizations can deploy both the Hadoop and Spark frameworks using the free open source versions or commercial cloud services and on-premises offerings. However, the initial deployment costs are just one component of the overall cost of running the big data platforms. IT and data management teams also must include the resources and expertise required to securely provision, maintain and update the underlying infrastructure and big data architecture.
One difference is that a Spark implementation typically will require more memory, which can increase costs when building out a cluster.
The broad Hadoop ecosystem also includes a variety of optional supporting technologies to install, configure and maintain, including widely used tools like the HBase database and Hive data warehouse software. Many of them can be used with Spark, too. Commercial versions of the frameworks bundle sets of these components together, which can simplify deployments and may help keep overall costs down.
Hadoop or Spark? It isn't always a rivalry
Hadoop and Spark aren't mutually exclusive. Sushant Rao, senior director of product marketing at big data platform vendor Cloudera, said that most businesses using Hadoop for data engineering, data preparation, machine learning and other applications are also using Spark as part of those workflows without any issues. In addition, both frameworks are commonly combined with other open source components for various tasks.
More than a half-dozen vendors initially created commercial Hadoop distributions, but the market has consolidated considerably. Cloudera remains as an independent vendor -- it acquired Hortonworks, a rival Hadoop pioneer, in 2019 and now offers a combined Cloudera Data Platform technology bundle that was designed to be cloud-native. In addition, cloud platform market leaders AWS, Microsoft and Google all offer cloud-based big data platforms and managed services with Hadoop, Spark and other big data technologies -- Amazon EMR, Azure HDInsight and Google Cloud Dataproc, respectively.
In a sign of the diminishing focus on MapReduce, though, AWS and Google have de-emphasized Hadoop in their marketing materials and now highlight Spark and some of the other technologies from the Hadoop ecosystem. Databricks, a vendor founded by Spark creator Matei Zaharia and others involved in that framework's early development, also offers a cloud-based data processing and analytics platform built on Spark, now known as the Databricks Lakehouse Platform.