Using big data and Hadoop 2: New version enables new applications
A comprehensive collection of articles, videos and more, hand-picked by our editors
The Apache Software Foundation recently released Hadoop 2, the newest version of its popular open source framework for highly scalable, distributed computing most commonly associated with big data. Hadoop 2 incorporates several new features, including YARN, a redesigned resource manager that Apache now describes as a large-scale, distributed operating system that allows multiple big data applications to run simultaneously.
But the release also shines a light on a major problem that companies mulling Hadoop initiatives are destined to face: The staggering lack of big data expertise in today's labor pool.
"There's a huge skills shortage," said Dr. Eric Little, vice president and chief scientist at Modus Operandi, a Melbourne, Fla., company that provides data management and analysis technology to government agencies, including the U.S. Navy and the U.S. Marine Corps.
Modus Operandi, which runs Hadoop among other big data technologies, is a small company and has found itself competing for the same talent as large software vendors and user organizations with more resources.
"Even the large businesses can't find these people," Little said. "What that means is that you're competing against IBM and Apple and Google and Amazon and Yahoo for the exact same pool of resources."
There are ways organizations can address the gaping skills gap. For one, they could invest in recruiting Hadoop specialists, though that's a luxury many companies can't afford.
Or they could cultivate big data skills on the inside. At Little's company, senior engineers spend plenty of time training junior ones so there's a steady stream of people who know how to handle big data analytics. At the same time, they can rely on Hadoop vendors like Cloudera, Hortonworks and MapR to do most of the heavy lifting. According to market research company Gartner Inc. in Stamford, Conn., organizations will have only about a third of the people needed to run big data implementations by 2016.
"And realistically, we think those skills are going to go to the Global 1000, the big systems integrators, and they're going to go to software companies," said Merv Adrian, a Gartner analyst. "The rest of us are really going to be scrambling."
Many organizations have already begun looking internally for people to take part in Hadoop or other big data implementation plans -- and it isn't just IT workers playing a part. A small but growing number of non-IT business analysts are experimenting with big data technologies and beginning to build useful skill sets; in fact, they have often been the Hadoop pioneers.
"They are the unnoticed coders, if you will. They have been building stuff on their own. They've done it on Amazon, or they've downloaded free distributions and played with them," Adrian said. "They are already somewhat down the path, and in a lot of organizations, recruiting from within and identifying available skills is actually going to be an interesting opportunity."
Modus Operandi recommends that anyone interested in Hadoop take a training course, such as those offered by Cloudera and other vendors, to get the ball rolling.
More on the Hadoop platform
Get to know more about the Hadoop 2 platform
Learn to avoid Hadoop performance bottlenecks
Gain insights into when to use Hadoop, and when not to
Little said it's also a good idea to keep an open mind when recruiting and training new talent. For example, he said, a background in mathematics may be more desirable than a background in computer science.
"Part of it is that this involves a pretty strong math background," Little said. "My experiences are that the people who are often quite good at this have to be real algorithm developers, which means they have to be pretty strong applied mathematicians."
Organizations launching or experimenting with Hadoop 2.0 will want to get to know the ecosystem of open source projects that have grown up with Hadoop. Most of them have exotic names like Hive, Pig, Mahout, ZooKeeper, Flume and Sqoop.
"One project that is really important these days is Ambari, which is the management environment for Hadoop," Adrian said. "Nobody should consider doing anything with Hadoop without knowing Ambari."
Help for Hadoop
Few organizations are willing to launch a Hadoop initiative without the support of a vendor, but it is possible to go the independent route.
"Lots of the early adopters played with pure Apache and stayed with it," Adrian said. "But in general, the level of effort required for the ongoing care and feeding, updates, maintenance, integration testing, regression testing and backporting that goes on is something that mainstream execs don't want to have to do."
Organizations should work closely with one of the major Hadoop distribution providers, like Cloudera, MapR or Hortonworks. Other Hadoop vendors include Intel and EMC, which began offering and supporting distributions earlier this year.
Perhaps the biggest benefit of dealing with a Hadoop vendor is this: It makes sense of all the open source projects required to make big data projects hum. In addition to YARN, the core components of Hadoop 2 include Hadoop Distributed File System and MapReduce. The other open source projects involved have their own steering committees and proceed at their own rates. The distributors pre-integrate open source systems and they provide development environments and operational benchmarks for users to do tuning.
What's new in Hadoop 2?
Regardless of how an organization addresses the big data skills gap, it's a good idea to gain an understanding of the new capabilities in Hadoop 2.0. This will give it a leg up if the time comes to negotiate with Hadoop distributors.
The addition of YARN in Hadoop 2 is important because the technology, which sits atop HDFS, makes it easier for users to create and share functionality between applications, said George Corugedo, co-founder and chief technology officer at RedPoint Global, a data management software company. YARN, short for the jocular "Yet Another Resource Negotiator" and also known as MapReduce 2.0, is replacing the previous version of MapReduce as the as the resource management tool of choice in Hadoop 2.
Hadoop 2 also offers high availability and federation features for HDFS, support for Microsoft Windows, and the ability to take "snapshots" of data stored in HDFS, according to Apache.
According to Corugedo, organizations getting started with these technologies should remember to think big and start small. "With Hadoop as with any new technology, it is important to build credibility and skill and not fall prey to the hype and over reach," he said. "If a practitioner, instead, builds on their successes, they can be the hero that actually makes it all work."