This content is part of the Essential Guide: Using big data and Hadoop 2: New version enables new applications

Evolving Hadoop ecosystem presents new ways to program big data apps

Change is on tap as Hadoop 2.0 spawns new development options. At Hadoop Summit 2013, the potential for big data innovation was vivid.

The Hadoop ecosystem is a body in motion. Just a few years ago, you might quickly but fairly describe Hadoop as "HDFS, MapReduce and some glue" -- referring to the Hadoop Distributed File System, its associated software programming model and an emerging collection of APIs and utilities, which together were becoming synonymous with big data systems. What you knew then was true, but only for a spell.

If you journeyed last week to the San Jose Convention Center in California for Yahoo's and Hortonworks' Hadoop Summit 2013 to look for an introduction to the open source framework, you walked in at a time when things are changing. Hadoop 2.0 is getting closer, and with it comes enhancements that could systematize an emerging style of data programming that comes under the banner of "Hadoop," but is more than just Hadoop itself.

While improvements are due for the Hadoop Distributed File System (HDFS) and Hadoop ecosystem components -- such as the HBase database, Hive data warehouse and Knox security gateway -- much of the attention now is being directed at Hadoop 2.0's YARN component. The acronym humbly stands for "Yet Another Resource Negotiator." The humility is deceptive because YARN allows you to swap out MapReduce if you choose, and promotes a type of interactivity different from the batch processing methods that brought Hadoop to prominence.

The Hadoop 2.0 HDFS implementation brings "some re-architecting of the system to remove some of the single points of failure," said Colin White, president and founder of consultancy BI Research in Ashland, Ore. That is a start -- but the real progress comes on the application programming interface (API) level. "What is quite a change is YARN," White said. "It enables you to use other file systems and things like that. It allows you to add flexibility to the environment, which is something enterprise users had been complaining about."

So your basic Hadoop definition goes by the boards. People are already talking about using IBM's General Parallel File System, the Lustre file system for high-performance computing clusters, as well as other file systems with Hadoop. With YARN, MapReduce, too, becomes an option, not a defining element.

No longer is there a "core Hadoop," said Merv Adrian, an analyst at Gartner Inc. in Stamford, Conn., though he added that the Apache open source development group would challenge that notion.

YARN and definitions aside, one could say that pluggable options for data architectures are the order of the day, and Hadoop is the means. Or, in Adrian's words: "There is no level at which substitutes are not possible."

Everyone under the Hadoop tent

Hadoop and the tools surrounding it emerged over the years, as Adrian told me, when Web application developers -- often using JavaScript -- began to determinedly create purpose-built data stores that they then put into the open source sphere.

"The Hadoop community is a center of gravity that is attracting innovative new uses," he said, while noting that Gartner itself recommends users rely on commercially available versions of Hadoop software and employ their freely downloadable open source counterparts only for sandboxing and the like.

What has happened in recent years, in Adrian's words, is "an explosion in data stores," many of them the NoSQL kind. They challenged the "SQL-only" data model of the day. Hadoop provides a big tent for the new movement.

The major reasons the various data stores came into being, Adrian said, were: First, the costs of the incumbent relational databases were too high for large-scale deployments; second, bureaucracy in the form of database schemas had too often become an encumbrance to invention; and third, relational data technology basically was not the right fit for Web applications.

Don't call me late for dinner

Now, data architects and developers interested in new varieties of data stores can find a home somewhere in the Hadoop ecosystem. At the Hadoop Summit, the enthusiasm about the new paradigm was palpable. We have seen this before -- with Java and AJAX (Asynchronous JavaScript and XML), for example. The Java language was just a stepping-off point for a whole new style of development, as was AJAX. And by the time AJAX had become a prominent style of development, it had, on the main, already jettisoned XML in favor of simpler JavaScript Object Notation methods. The name "AJAX" was just a place marker; the style of programming was the thing.

What's important is that what the Hadoop community is doing these days represents a major shift for data management. The strangely named litany of open source tools and APIs that are camped out in the Hadoop tent lets developers work with data in innovative ways that the old data regime just didn't allow.

Dig Deeper on Hadoop framework