Strata + Hadoop World 2016: Hadoop and Spark in spotlight
Reporting and analysis from IT events
Poor MapReduce. Until late 2013, it was a critical cog in all Hadoop systems, serving as both the cluster resource manager and the primary programming and processing environment for the open source big data framework. But then things started to change.
The Apache Software Foundation's Hadoop 2 release added a new technology called YARN that usurped the resource management role and opened up Hadoop to applications other than MapReduce batch jobs. A still-growing gaggle of vendors rolled out SQL-on-Hadoop tools that let users write analytical queries against Hadoop data in standard SQL instead of MapReduce. And the Spark processing engine burst onto the scene, with proponents claiming it can run batch jobs up to 100 times faster than MapReduce, while supporting higher-level programming in popular languages such as Java and Python.
With all those forces arrayed against it, MapReduce has been, er, reduced in stature -- like an old steam engine being forced to give way to sleeker diesel locomotives. That sense was palpable at the Strata + Hadoop World 2015 conference in New York City, where various attendees talked about trying to get away from MapReduce -- in the words of one speaker, "as soon as possible and as much as possible."
The no-more MapReduce sentiment reached its apex at a session on MapReduce Geospatial, an open source toolkit for use in processing satellite images and other large sets of raster data. It turned out that the developers had just switched the technology, also known as MrGeo, from MapReduce to Spark. The result was faster performance and a 25% reduction in the code base, according to conference speaker Ryan Smith, an analytics manager at satellite imaging company DigitalGlobe. After the session, Smith acknowledged that it's probably time to come up with a new name for the toolkit.
And it isn't just MapReduce. The Hadoop Distributed File System (HDFS) -- the other core component of Hadoop's first incarnation -- also finds itself a bit on the run these days. At the Strata conference, Cloudera, the leading Hadoop distribution vendor, announced a columnar data store called Kudu as a potential alternative to HDFS for applications involving real-time analytics on streaming data. Hortonworks, another Hadoop vendor, introduced a separate piece of software for managing the flow of data between different systems, with no requirement that HDFS be part of the picture.
Neither MapReduce nor HDFS is going away anytime soon. There are too many applications built on top of them for that to happen, and plenty of Hadoop users do remain committed to the pairing, at least for some of their big data processing needs. But it's entirely likely that new Hadoop systems will be deployed without either of the two technologies that once comprised Hadoop.
Will they still really be Hadoop systems? That's kind of an existential question. But Hadoop's evolution -- or identity crisis, perhaps -- is indicative of the unsettled data management environment ushered in by the big data era. Gone are the verities of the relational database and SQL. Instead, we're living in a polyglot world with a variety of technology choices for different data processing and analytics needs. Relational software is still included, of course, but so are Hadoop, Spark, NoSQL databases and a vast and ever-expanding ecosystem of other big data tools. And Hadoop's position at the center of that ecosystem isn't guaranteed forever -- except, maybe, in name only.
Hadoop 2's potential and problems
Pros and cons of Hadoop clusters
NoSQL breaks the relational database monopoly