Using big data and Hadoop 2: New version enables new applications
A comprehensive collection of articles, videos and more, hand-picked by our editors
As with most 2.0 releases, Apache Hadoop 2 is a potentially key step forward for the open source distributed processing framework. The first version of Hadoop has found growing uses, particularly for processing large amounts of unstructured data and acting as a staging area for incoming information. But it also presented some significant limitations to users.
Hadoop 2, which originally was referred to as Hadoop 2.0, makes several major architectural advances, most notably to add support for running non-batch applications created with programming models other than MapReduce. It also supports federation of Hadoop Distributed File System operations and configuration of redundant HDFS NameNodes to increase scalability and eliminate a nasty single point of failure that was part of the original design. In great part, Hadoop 2 is meant to widen the technology's utility for enterprise applications.
Hadoop has been like a freight train carrying freight.
principal consultant, Think Big Analytics
Prospective users looking to kick the proverbial tires on the new car that is Hadoop 2 likely have lots of questions about the Hadoop upgrade. Here are some answers for IT managers, data architects, developers and business executives involved in evaluating potential deployments of Hadoop clusters.
When can I get my hands on Hadoop 2?
The Apache Software Foundation released Hadoop 2 for general availability in October 2013, after a series of alpha releases that began in May 2012 and two beta releases in August and September of last year. In addition to the downloadable community version, commercial Hadoop distribution providers have subsequently made the new software available to their customers. As with any open source software, though, bug reports and fixes are still part of the daily fare for Hadoop. So it's best to keep an eye open for issues.
What's the story with YARN?
It's worthwhile to keep in mind that "Hadoop, as it first came out, was a learning experience," said Dave Wells, an independent consultant at Seattle-based Infocentric and an instructor for The Data Warehousing Institute. "It was more about things patched together than it was about design and structure." With Hadoop 2, some of that patchiness begins to subside -- and a key contributor to that is a software layer known as YARN.
The most common knock on Hadoop 1.x, which coupled HDFS with the MapReduce parallel programming model, was that its batch-oriented format limited its use in interactive and iterative analytics, and pretty much eliminated the possibility of using the technology in real-time operations altogether. Hadoop 2 changes that, principally by the insertion of YARN.
Although its name is modest, YARN -- short for Yet Another Resource Negotiator -- casts a long shadow. It's a rebuilt cluster resource manager that ends Hadoop's total reliance on MapReduce and its batch processing format. YARN does that by separating the resource management and job scheduling capabilities previously handled by MapReduce from Hadoop's data processing layer. As a result, MapReduce becomes just one of many processing engines that can sit on top of YARN in Hadoop clusters.
In effect, YARN opens the door for other programming frameworks and new types of applications, according to Douglas Moore, a principal consultant at Think Big Analytics in Mountain View, Calif. Until now, Moore said, "Hadoop has been like a freight train carrying freight." Hadoop 2, he added, will also be able to support programming approaches that let it "go around a racetrack very quickly, like a Lamborghini."
What's with all the talk about HDFS high-availability and federation in Hadoop 2?
As it was originally built, Hadoop had some big drawbacks as a parallel processing platform. Clusters were dependent on a single namespace server, called a NameNode; it maintained a directory tree of files in HDFS and kept track of where in a cluster data was stored so information could be found when needed. That created a single point of control in a cluster, which caused real trouble if the NameNode went down. It also put fetters on the ability of users to expand clusters and scale up their performance.
For more on Hadoop 2.0 technologies
Learn about Hadoop 2's impact on big data
See what happens when YARN meets Storm event processing
Listen to a podcast on where Hadoop is headed
Those problems led to the development of the new high-availability and federation features for HDFS. Now pairs of redundant NameNodes can be configured to provide a backup in case the active one crashes or requires maintenance work. And independent NameNodes that share a pool of data storage can be added at will to, in Moore's words, "spread the processing out."
The new capabilities were much-needed, said William Bain, CEO of in-memory data grid vendor ScaleOut Software Inc. "Single points of failure are not acceptable in any distributed environment," he added. The HDFS federation and high-availability features also set the stage for processing of bigger and bigger data pools, said Sanjay Sharma, a principal architect at software development services provider Impetus Technologies Inc. in Los Gatos, Calif. The federation scheme in particular is crucial for helping to grow Hadoop's data processing capacity "to the petabyte level," Sharma said.
Is Hadoop a mature, enterprise-ready technology now that Hadoop 2 is out?
Ending the reliance on MapReduce and introducing HDFS federation and high-availability are big steps toward maturity for Hadoop. The technology also now supports Windows and point-in-time data snapshots for backup and disaster recovery purposes. But it can still be a complicated platform to work with, in part because of its openness -- and its dependence on a diverse ecosystem of supporting tools to meet application needs. Some assembly typically is required in building out Hadoop-based environments. And Hadoop is at the center of ongoing changes in data architecture that seem to guarantee a bit of a "Wild West" feel for some time to come.
Hadoop 2's release does show how thinking about the framework has changed in recent years, according to Doug Cutting, one of Hadoop's original creators while working at Yahoo and now chief architect at Hadoop vendor Cloudera Inc.
"In 2009, when the 0.20 release was created, most folks thought of Hadoop as a useful tool in and of itself," Cutting said via email. "It primarily provided a MapReduce engine, making scalable, reliable batch computing available to enterprises." Now Hadoop can support a far wider range of workloads, he continued.
Nonetheless, even with Hadoop 2 now in the picture, Hadoop remains new territory, holding both promise and pitfalls for prospective users.