As the Hadoop distributed processing framework has evolved, it has come to include far more than its original core, which consisted of the Hadoop Distributed File System (HDFS) and MapReduce programming environment. Among a slew of new Hadoop ecosystem components, one technology has gained particular attention: the Spark in-memory data processing engine. Spark is replacing MapReduce for an increasing number of batch-processing jobs in Hadoop clusters -- proponents claim it can run them as much as 100 times faster.
After the Apache Spark open source software became available last year, Hadoop distribution vendors were quick to add the technology -- soon to be updated in a version 1.6 release -- to their product portfolios. But while Spark is now often found in big data applications, along with HDFS and Hadoop's YARN resource manager, it can also be used as a standalone service. That is sparking a growing debate in data management circles regarding Spark vs. Hadoop.
Will Hadoop continue to be a starting point for Spark? To gain a user view on that question, SearchDataManagement asked attendees at Strata + Hadoop World 2015 in New York whether they see the Spark processing engine as a complement to Hadoop, or an alternative to it and components such as YARN and MapReduce. Here's what some of them had to say on the Spark vs. Hadoop issue.
Sridhar Alla, big data architect at cable TV company Comcast: "Spark doesn't really store anything. Processing in Spark is replacing MapReduce and YARN, but the storage layer is going to be Hadoop for a long time."
Hakan Jonsson, data scientist for the Lifelog product team at Sony Mobile Communications: "It's a replacement. Spark is much faster than Hadoop is. And from a productivity standpoint, you don't have to do [analytical] modeling in a separate tool."
Brett Shriver, senior director of market regulation technology at the Financial Industry Regulatory Authority, or FINRA: "There are four or five performance-challenging [surveillance] patterns in our portfolio, and they're targeted for Spark. Long term, who knows? It may be the way we go. The jury is still out."
Joe Hsy, director of cloud services platforms and tools for Cisco's WebEx unit: "I think Spark is going to replace a large part of what we use MapReduce for now. And, over time, if Spark continues to expand its functionality, it could completely replace MapReduce."
William Theisinger, vice president of engineering at Yellow Pages producer YP LLC: "You need to get to where using technologies is predictable, and I wouldn't say that about Spark today. I'm still going to have to support MapReduce, too."
Charlie Crocker, business analytics program lead at software vendor Autodesk: "Whether you're using Hadoop or Spark, I think it's going to become a philosophical question. If you want to be revolutionary, you can say Hadoop is dead. But Hadoop isn't dead."
Hadoop has something of a head start on deployments, and despite MapReduce's reduced stature, many already running MapReduce jobs will likely continue to do just that -- run. Also, there has been a learning curve involved with getting Hadoop proof-of-concept applications into production, and Spark may well face a similar curve.
In a way, the ascent of Spark shows the ability of Hadoop to expand beyond its original components. And the onslaught of new big data technologies is likely to continue no matter how the issue of Spark vs. Hadoop plays out.
Executive editor Craig Stedman contributed to this story.
Listen to a podcast on the maturation of Hadoop and Spark
Check out another view on Hadoop vs. Spark
Find out how IT pros view the NoSQL database surge
Hadoop data engines and NoSQL evolve in 2015