Evaluate Weigh the pros and cons of technologies, products and projects you are considering.

MapReduce(d) in the eyes of many Hadoop systems users

New technologies are augmenting -- and in some cases replacing -- the core components of Hadoop. Welcome to the new, not-so-settled reality of the big data era.

This article can also be found in the Premium Editorial Download: Business Information: IoT applications make advances, but hurdles lie ahead:

Poor MapReduce. Until late 2013, it was a critical cog in all Hadoop systems, serving as both the cluster resource manager and the primary programming and processing environment for the open source big data framework. But then things started to change.

The Apache Software Foundation's Hadoop 2 release added a new technology called YARN that usurped the resource management role and opened up Hadoop to applications other than MapReduce batch jobs. A still-growing gaggle of vendors rolled out SQL-on-Hadoop tools that let users write analytical queries against Hadoop data in standard SQL instead of MapReduce. And the Spark processing engine burst onto the scene, with proponents claiming it can run batch jobs up to 100 times faster than MapReduce, while supporting higher-level programming in popular languages such as Java and Python.

With all those forces arrayed against it, MapReduce has been, er, reduced in stature -- like an old steam engine being forced to give way to sleeker diesel locomotives. That sense was palpable at the Strata + Hadoop World 2015 conference in New York City, where various attendees talked about trying to get away from MapReduce -- in the words of one speaker, "as soon as possible and as much as possible."

The no-more MapReduce sentiment reached its apex at a session on MapReduce Geospatial, an open source toolkit for use in processing satellite images and other large sets of raster data. It turned out that the developers had just switched the technology, also known as MrGeo, from MapReduce to Spark. The result was faster performance and a 25% reduction in the code base, according to conference speaker Ryan Smith, an analytics manager at satellite imaging company DigitalGlobe. After the session, Smith acknowledged that it's probably time to come up with a new name for the toolkit.

Hadoop's evolution -- or identity crisis, perhaps -- is indicative of the unsettled data management environment ushered in by the big data era.

And it isn't just MapReduce. The Hadoop Distributed File System (HDFS) -- the other core component of Hadoop's first incarnation -- also finds itself a bit on the run these days. At the Strata conference, Cloudera, the leading Hadoop distribution vendor, announced a columnar data store called Kudu as a potential alternative to HDFS for applications involving real-time analytics on streaming data. Hortonworks, another Hadoop vendor, introduced a separate piece of software for managing the flow of data between different systems, with no requirement that HDFS be part of the picture.

Neither MapReduce nor HDFS is going away anytime soon. There are too many applications built on top of them for that to happen, and plenty of Hadoop users do remain committed to the pairing, at least for some of their big data processing needs. But it's entirely likely that new Hadoop systems will be deployed without either of the two technologies that once comprised Hadoop.

Will they still really be Hadoop systems? That's kind of an existential question. But Hadoop's evolution -- or identity crisis, perhaps -- is indicative of the unsettled data management environment ushered in by the big data era. Gone are the verities of the relational database and SQL. Instead, we're living in a polyglot world with a variety of technology choices for different data processing and analytics needs. Relational software is still included, of course, but so are Hadoop, Spark, NoSQL databases and a vast and ever-expanding ecosystem of other big data tools. And Hadoop's position at the center of that ecosystem isn't guaranteed forever -- except, maybe, in name only.

Next Steps

Hadoop 2's potential and problems

Pros and cons of Hadoop clusters

NoSQL breaks the relational database monopoly

This was last published in December 2015

PRO+

Content

Find more PRO+ content and other member only offers, here.

Conference Coverage

Strata + Hadoop World 2016: Hadoop and Spark in spotlight

Join the conversation

1 comment

Send me notifications when other members comment.

By submitting you agree to receive email from TechTarget and its partners. If you reside outside of the United States, you consent to having your personal data transferred to and processed in the United States. Privacy

Please create a username to comment.

Does MapReduce meet your resource management and programming needs?
Cancel

-ADS BY GOOGLE

SearchBusinessAnalytics

SearchAWS

SearchContentManagement

SearchOracle

SearchSAP

SearchSQLServer

Close