The buzz around high-flying Hadoop dimmed down during the dog days of this summer. Some disenchantment is to be expected with any new technology, but Hadoop's visit to the ''trough of disillusion'' comes at a difficult time. Its long-time fellow traveller MapReduce is somewhat to blame.
Vendors and users alike have invested a lot in the parallel platform, but when you go beyond financial and Web marketing applications, Hadoop seems less visible. It often appears stalled at the proof of concept stage. Full production Hadoop in the enterprise seems to be somewhere in the wings.
The process has been seen before. As Hadoop (or another world beater technology) becomes more familiar, its strongpoints are taken for granted and its shortcomings get more publicity. In fact, it is Hadoop's original MapReduce component, rather than its file system and APIs, that is often the stumbling block for the platform. This is the case even though the 2.0 version of the Hadoop framework has opened up to MapReduce alternatives.
MapReduce as wobbly wheel
Invented at Google, batch-oriented MapReduce was a programming approach well suited to Web searching. That is about where consensus stops. It's difficult to program. Not suited for many analytical applications. It will never be real time. And so on. In a way, MapReduce is the wobbly wheel on the Hadoop juggernaut.
MapReduce's shortcomings were less vivid when it was the only game in town. Hadoop's doldrums summer came in part at the hands of interesting new alternatives for data processing.
Two summer conferences, the Spark Summit and Google I/O, gave some pause from what had pretty much become a loud clamor of Hadoop fanfare.
At the Spark Summit, attention focused on the Apache Spark data processing system. There, Hadoop naturally took a backseat.
The Spark software, built at the University of California at Berkley, has been percolating for a few years, but more rapidly of late. Its non-batch operations and high-level programming interface seem to interest people looking for MapReduce alternatives. The software can run on Hadoop, yes, but it also can run on directly on systems such as Cassandra or Amazon S3.
There is a buzz about Spark. Data industry analyst Curt Monash sees a consensus rising that "the best next-generation parallelization paradigm is the one in Spark." It has special promise for streaming data, he said, and certainly that is an area that is challenging many next-generation big data applications that feature data-in-motion, or "fast data," operations, as well as machine learning operations.
Earlier this summer, at the Google I/O conference, MapReduce -- and traditional Hadoop, by extension -- took a black eye of some proportion. That was based on remarks that Google, the software's inventor, really did not itself use MapReduce anymore. At the same time, the company introduced a MapReduce competitor in the form of Google Cloud Dataflow application framework.
Tom Kershaw, product management lead for Google Cloud, said Dataflow moves data quickly and improves on MapReduce processing abilities that he finds wanting.
"MapReduce is oversimplified to the point of being terrifying," he said. MapReduce jobs run in chunks that "are hard to assemble," he said, while noting that Google Cloud Dataflow instead creates components that you can string together logically." This doesn't sound too far different from some of the conversation surrounding Hadoop 2.0's Yarn resource manager -- but, when, Google speaks, developers listen.
Trough time for Hadoop
Ironically, perhaps, at the same time Google rolls out a competitor to MapReduce-based Hadoop, it is making sure to support the Hadoop/MapReduce platform for customers of its cloud services. But the bloom is off the Hadoop/MapReduce platform, it seems.
At this summer's TDWI BI Summit, cloud and integration consultant and author David Linthicum noted Google's sidestep of MapReduce. "A lot of people are rethinking Hadoop," he said. And that sort of sums up a whole school of thought this summer.
Is it fair? Maybe not. Hadoop has a new architecture that casts off from MapReduce dependence. Underlying last year's hoopla around Hadoop 2.0 was admission that MapReduce had many limitations.
That was what the introduction of Yarn was all about. Still, Hadoop is old enough to have the problem of legacy -- also known as ''installed base.'' You have to ask if all the existing MapReduce apps will be rewritten as real-time apps overnight. The answer is probably not.
So Hadoop faces the classic problem of maintaining its installed base while expanding into real-time operations. As more Hadoop 2.0 applications are built and publicized, Hadoop will climb out of the trough, and be better for the respite, but this will involve some passage of time.
Read about what's new - and dicey - in Hadoop 2.0
Frequently asked Hadoop 2.0 questions answered here
Be there as Spark 1.0 goes live