Managing Hadoop projects: What you need to know to succeed
A comprehensive collection of articles, videos and more, hand-picked by our editors
Storm-on-YARN, one in a constellation of rising Hadoop ecosystem stars, can serve as an example of the dichotomy building between large-scale Web application shops and the typical enterprise.
Invented at Yahoo and recently released into open source, Storm-on-YARN couples the Storm event processor with Hadoop 2.0's YARN application management framework, promising to add low-latency processing to Hadoop applications formerly limited to batch jobs.
The combo has potential for large-scale applications. But at the moment, for many IT shops that potential is "nice" but not "necessary."
Even for the die-hard nerd, today's bugle charge of Hadoop-related projects must sometimes seem too much. For many enterprise shops, the problem is that each new Hadoop wrinkle could mean another set of specialized skills to obtain.
It's great that there is enthusiasm to solve problems and build innovative open source software for data handling -- but the Hadoop onslaught is such that some enterprise shops may just decide to wait out the deluge and get involved when times are "less interesting." After all, much about Hadoop requires Java know-how that isn't found too deeply in most enterprise data warehouse teams.
Hadoop/NoSQL growth needs enterprise fuel
Yahoo and the other Web companies that drove early Hadoop development were replete with such skills. Much of the ballooning interest in Hadoop is fueled by the idea that the technology's use will expand from its roots in those companies and flower in more traditional businesses. But that could take years to happen.
Our present is your future.
VP of platforms, Yahoo
While analysis and market research outfit Wikibon foresees considerable demand ahead for Hadoop and compatriot NoSQL database technologies, it cautions that market growth could be tempered by a lack of skills in the enterprise.
Wikibon recently projected 45% compound annual growth for the software and services portion of the Hadoop/NoSQL market over the next five years. That would vault this dynamic data duo from $540 million in worldwide revenue last year to $3.5 billion in 2017.
But report author Jeff Kelly, lead big data analyst at Wikibon, wrote that a lack of trained administrators and developers is making some companies reluctant to deploy commercial Hadoop and NoSQL technologies. In addition, he said, corporate IT chiefs don't always view the technologies as being fully enterprise-ready.
Yahoo has been at the forefront of Hadoop ever since the technology originated in 2005 -- long enough to uncover plenty of gotchas in developing and running Hadoop applications.
Hadoop pioneer takes some arrows
For example, Yahoo had problems getting a Hadoop system to process data on website user activities at a fast enough clip, according to Bruno Fernandez-Ruiz, a senior fellow and vice president of platforms at the company. He spoke about the performance issue in a keynote session at the Hadoop Summit 2013, a conference held in June by Yahoo and its Hortonworks Hadoop services spin-off.
Fernandez-Ruiz said the combination of Hadoop and MapReduce had issues keeping up with the incessant flow of information about e-mails, page views, searches and other user events that people at Yahoo wanted to quickly correlate with the company's online ad inventory.
For more on
Track Hadoop as it moves up from the sandbox
What's behind the Actian-ParAccel deal, Pivotal plan?
Paint with a richer data management palette
The big culprit, he added, was MapReduce's batch orientation. Batch jobs created with the open source programming model might take two or three hours to process information; meanwhile, more data about new events continued to come in. For some applications, that's OK, Fernandez-Ruiz said. But, he noted, by the time a user-events batch job was completed, the site had garnered much more information. Yahoo collects data on "billions" of events a day, he said, and for purposes such as ad and events correlation, technologists at the company thought it would be worthwhile to process the data in a continuous stream.
So they came up with Storm-on-YARN. While long-running jobs are still handled as MapReduce batches, Storm processes low-latency events that can be added at the end of the MapReduce runs in order to get a more complete view of user activity.
Yahoo has more than 365 petabytes of Hadoop storage and, already, 30,000 cluster nodes managed by YARN (which stands for Yet Another Resource Negotiator). That is a lot more than you'll find in a typical organization today. Fernandez-Ruiz knows that but sees others taking a similar path eventually. "Our present is your future," he told the Hadoop Summit crowd.
Still, the methods of the Hadoop pioneers at Web wunderkinds like Google, Yahoo and eBay don't necessarily translate to success in the traditional enterprise. And how fast Hadoop applications take hold in mainstream IT shops will go a long way in influencing the overall speed of Hadoop's uptake.