Essential Guide

Browse Sections


This content is part of the Essential Guide: Trend watch: Data management and business intelligence technologies
Manage Learn to apply best practices and optimize your operations.

MapReduce redux, or How Hadoop spent its summer vacation

Hadoop buzz dimmed down during summer's dog days. Its old pal MapReduce is somewhat to blame.

The buzz around high-flying Hadoop dimmed down during the dog days of this summer. Some disenchantment is to be expected with any new technology, but Hadoop's visit to the ''trough of disillusion'' comes at a difficult time. Its long-time fellow traveller MapReduce is somewhat to blame.

Vendors and users alike have invested a lot in the parallel platform, but when you go beyond financial and Web marketing applications, Hadoop seems less visible. It often appears stalled at the proof of concept stage. Full production Hadoop in the enterprise seems to be somewhere in the wings.

The process has been seen before. As Hadoop (or another world beater technology) becomes more familiar, its strongpoints are taken for granted and its shortcomings get more publicity. In fact, it is Hadoop's original MapReduce component, rather than its file system and APIs, that is often the stumbling block for the platform. This is the case even though the 2.0 version of the Hadoop framework has opened up to MapReduce alternatives.

MapReduce as wobbly wheel

Invented at Google, batch-oriented MapReduce was a programming approach well suited to Web searching. That is about where consensus stops. It's difficult to program. Not suited for many analytical applications. It will never be real time. And so on. In a way, MapReduce is the wobbly wheel on the Hadoop juggernaut.

Is MapReduce the wobbly wheel on the Hadoop juggernaut?

MapReduce's shortcomings were less vivid when it was the only game in town. Hadoop's doldrums summer came in part at the hands of interesting new alternatives for data processing.

Two summer conferences, the Spark Summit and Google I/O, gave some pause from what had pretty much become a loud clamor of Hadoop fanfare.

At the Spark Summit, attention focused on the Apache Spark data processing system. There, Hadoop naturally took a backseat.

The Spark software, built at the University of California at Berkley, has been percolating for a few years, but more rapidly of late. Its non-batch operations and high-level programming interface seem to interest people looking for MapReduce alternatives. The software can run on Hadoop, yes, but it also can run on directly on systems such as Cassandra or Amazon S3.

There is a buzz about Spark. Data industry analyst Curt Monash sees a consensus rising that "the best next-generation parallelization paradigm is the one in Spark." It has special promise for streaming data, he said, and certainly that is an area that is challenging many next-generation big data applications that feature data-in-motion, or "fast data," operations, as well as machine learning operations.

Earlier this summer, at the Google I/O conference, MapReduce -- and traditional Hadoop, by extension -- took a black eye of some proportion. That was based on remarks that Google, the software's inventor, really did not itself use MapReduce anymore. At the same time, the company introduced a MapReduce competitor in the form of Google Cloud Dataflow application framework.

Tom Kershaw, product management lead for Google Cloud, said Dataflow moves data quickly and improves on MapReduce processing abilities that he finds wanting.

"MapReduce is oversimplified to the point of being terrifying," he said. MapReduce jobs run in chunks that "are hard to assemble," he said, while noting that Google Cloud Dataflow instead creates components that you can string together logically." This doesn't sound too far different from some of the conversation surrounding Hadoop 2.0's Yarn resource manager -- but, when, Google speaks, developers listen.

Trough time for Hadoop

Ironically, perhaps, at the same time Google rolls out a competitor to MapReduce-based Hadoop, it is making sure to support the Hadoop/MapReduce platform for customers of its cloud services. But the bloom is off the Hadoop/MapReduce platform, it seems.

At this summer's TDWI BI Summit, cloud and integration consultant and author David Linthicum noted Google's sidestep of MapReduce. "A lot of people are rethinking Hadoop," he said. And that sort of sums up a whole school of thought this summer.

Is it fair? Maybe not. Hadoop has a new architecture that casts off from MapReduce dependence. Underlying last year's hoopla around Hadoop 2.0 was admission that MapReduce had many limitations.

That was what the introduction of Yarn was all about. Still, Hadoop is old enough to have the problem of legacy -- also known as ''installed base.'' You have to ask if all the existing MapReduce apps will be rewritten as real-time apps overnight. The answer is probably not.

So Hadoop faces the classic problem of maintaining its installed base while expanding into real-time operations. As more Hadoop 2.0 applications are built and publicized, Hadoop will climb out of the trough, and be better for the respite, but this will involve some passage of time.

Next Steps

Read about what's new - and dicey - in Hadoop 2.0

Frequently asked Hadoop 2.0 questions answered here

Be there as Spark 1.0 goes live

Dig Deeper on Hadoop framework

Join the conversation


Send me notifications when other members comment.

Please create a username to comment.

Do you think MapReduce issues slow down Hadoop adoption?
The problem I see with hadoop is understanding where it is a fit and how it can be used. Plenty of organizations have unstructured data that could be searched. (Though in many cases splunk can do the trick just fine with no programming.)

I don't know how you can say mapreduce will never be real time; google is essentially real time, and it runs mapreduce!

Real time is a movable feast - as you say Google searches are BDQ. Yet, MapReduce is being complimented at Google these days by a variety of frameworks. I appreciate your point that there is much unstructured data, and hence interest in Hadoop - but figuring out what to do with all that data in Hadoop seems to be a stumbling block for the majority of people. Thanks for you comment, Matt.