Using big data and Hadoop 2: New version enables new applications
A comprehensive collection of articles, videos and more, hand-picked by our editors
Even Hadoop's most enthusiastic proponents might admit that its marriage to MapReduce has limited what the open source technology can do. But with the advent of Hadoop 2 and its key component, the Hadoop YARN resource manager, the distributed processing framework has become a kind of launch pad for new applications incorporating a variety of related tools.
For example, Hadoop 2 is making real-time processing and analysis of streaming data possible for Synapse Wireless Inc., a Huntsville, Ala., maker of intelligent control and monitoring systems connected by a wireless mesh network. In present parlance, the company creates a "network of things" that uses the Internet to collect operational data from sensors and devices at customer sites. Some of the uses it supports are monitoring of healthcare operations and of large-scale commercial and residential lighting systems and solar panel fields.
"Our systems can capture high-velocity data streams coming off all these remote devices," said Bryan Stone, a cloud architect and lead platform developer at the company. With the pairing of Hadoop 2 and Storm, he added, "we don't just capture the data. We're also able to act on it. We can present it in a meaningful way so it can affect our customers' business decisions."
Bryan Stonecloud architect, Synapse Wireless Inc.
Using data integration tools from software vendor Pentaho Corp., Stone and his colleagues at Synapse Wireless have created a pilot healthcare monitoring application that puts Storm on top of YARN in a Hadoop 2 cluster. The application is intended to ensure good hand-washing hygiene in hospitals, as an example of what can happen when big data meets cloud computing and the Internet of Things.
As part of the application, tags on the badges that nurses wear can track their movements around a hospital. Other tags collect data on the use of hand-cleanser dispensers. When a nurse enters a patient's room, a timer starts on the use of the dispenser there. If the application doesn't register that the dispenser has been used, Stone said, "We can send an alert down to the badge that the nurse is wearing as a reminder that she needs to wash her hands."
Hadoop YARN gives batch jobs some company
While the original MapReduce-dependent version of Hadoop allowed Synapse Wireless to gather and analyze hand-washing data, the company couldn't act upon it immediately. Stone still sees value in MapReduce-based batch processing and analytics. But YARN "makes Hadoop more of [a platform] that you can build applications on top of," he said. "You can still use MapReduce in batch ways. But now you can roll out other applications, too."
Yahoo Inc., the company where Hadoop first took seed, has been testing Hadoop 2 and YARN (which is short for Yet Another Resource Negotiator) since September 2012. Yahoo built a Storm-on-YARN application to enable faster processing of website user activity data after a MapReduce batch program became unable to handle the information fast enough to meet the company's analytics and reporting needs for serving up targeted ads to site visitors. It released the application as an open source technology last year.
Speaking at the Hadoop Summit 2013, Bruno Fernandez-Ruiz, a senior fellow and vice president of platforms at Hadoop, described YARN as a flexible cog in the Hadoop framework -- one that makes real-time processing in Hadoop clusters much more feasible than it was when they could only run MapReduce applications.
"The problem with MapReduce computing is the batch window," Fernandez-Ruiz said, adding that users such as Yahoo can't afford to queue up data for processing while waiting for a three-hour batch job to finish running.
YARN's capabilities have even led the Apache Software Foundation, which manages the development of Hadoop, and vendors such as Yahoo spin-off Hortonworks Inc. to label it as an "operating system." That might be an overstatement, industry analysts said. But they agreed that YARN provides an opportunity to broaden the use -- and benefits -- of Hadoop applications.
Calling YARN an operating system "is generous," said Gartner Inc. analyst Nick Heudecker. He compared it more to an application server, pointing to the Java middleware engines that began to gain ground in the late 1990s. And that's a good thing for users, according to Heudecker. "Developers can slide in different frameworks, some of which can be tightly integrated into the overall Hadoop stack," he said.
Making more work for Hadoop clusters
Philip Russom, data management research director at The Data Warehousing Institute, said Hadoop YARN's ability to concurrently execute and manage multiple processing jobs is something that "any decent operating system" should be expected to handle. "The concurrency feature alone makes Hadoop far more palatable to many [organizations]," he said, "because it enables multiple users with multiple application types to work simultaneously in the Hadoop environment."
Heudecker said YARN should also allow users to consolidate multiple Hadoop clusters, set up to process jobs simultaneously, into one large system. Instead of having the Hadoop equivalents of data marts, IT managers can combine systems and better rationalize technology, processing and management costs, he said.
James Dixon, Pentaho's founder and chief technology officer, said YARN "will reduce the amount of MapReduce code people write," something he views as a step forward for users. Dixon minces few words in describing the limitations of MapReduce, claiming that it meets only a narrow set of processing needs.
"There are very few problems for which MapReduce is the right solution," he said. What YARN provides that MapReduce doesn't, he added, is the ability to "pick the right programming framework for individual problems."
Get the FAQs on Hadoop 2's key features
Learn how Hadoop and HBase deal with security data
Check out 2013's top enterprise Hadoop stories