Managing Hadoop projects: What you need to know to succeed
A comprehensive collection of articles, videos and more, hand-picked by our editors
Hadoop initially was the province of the large Internet companies that created it, and the likes of eBay, Facebook, LinkedIn, Twitter and Yahoo remain marquee users of the open source distributed processing technology. A growing variety of organizations are also now looking to take advantage of Hadoop and related big data technologies -- for example, NASA, retailer Kohl's, agricultural and chemical products maker Monsanto and automobile pricing data provider Edmunds.com.
Overall, though, Hadoop cluster use still isn't widespread. In a survey of TechTarget readers on business intelligence, analytics and data warehousing implementations earlier this year, the percentage of active Hadoop and MapReduce users was still in the single digits, and nearly two-thirds of the 387 respondents who answered the question said their organizations had no current plans to deploy the two technologies. Even in companies with big data programs in place or planned, Hadoop ranked fourth on the list of technologies being used or eyed to help underpin the initiatives.
It's relatively easy to do a small Hadoop implementation and try it out. But to move it to the infrastructure level is hard.
product manager and technical architect, Gene by Gene Ltd.
Because Hadoop is novel to most users, deploying it can present unfamiliar challenges to data architects and other members of project teams -- especially if they don't have experience with open source software or parallel processing on distributed clusters. Even seasoned IT hands may find surprises in working with Hadoop, for much assembly typically is required.
IT managers and corporate executives might look at how the Internet pacesetters are using Hadoop and "see a chance to do bigger systems at less cost," said Ofir Manor, a product manager and technical architect at Gene by Gene Ltd., a Houston-based genetic testing services company. But Manor, who also writes a blog on big data technologies, added that those expectations can be difficult to meet.
"It's relatively easy to do a small Hadoop implementation and try it out," he said. "Playing with the technology can be fun. But to move it to the infrastructure level is hard."
In addition to the technical challenges of deploying large-scale Hadoop systems and applications, another issue Manor cited is that IT operations often work in silos, with separate teams handling systems administration, database administration, storage, networking, security and application development. That approach can lead to problems in managing Hadoop clusters, he warned: "Hadoop requires more teamwork than usual, and enterprises may fall into a 'which team owns the platform?' debate."
Navigating the open source software culture can be a hurdle for some companies, too. The commercial distributions of Hadoop offered by a variety of IT vendors do help simplify the process of rolling out and supporting the software. But Manor said organizations have to ask themselves if they're ready and willing to commit their own developers to involvement in the Hadoop community, which can aid in efforts to take full advantage of the technology.
Many moving parts to pin down
Successfully implementing Hadoop platforms requires first coming to terms with the process of setting up the computer cluster that will run the software. And while Hadoop clusters are usually built around low-cost and easy-to-use servers, there are numerous configuration settings and issues to work through up front.
"Hadoop is a very complex environment. There are a lot of moving parts," said Douglas Moore, a principal consultant at Think Big Analytics, a consulting and development services provider in Mountain View, Calif., that focuses on big data deployments.
Moore said a Hadoop implementation team needs to make sure the size and overall design of its system are sufficient to handle the pipeline of data that will be fed into the cluster. Job scheduling routines and the performance of disk drives and other hardware components can also factor into the Hadoop-cluster performance equation.
For example, RAID Level 0 striping of data across a disk array, typically turned on by default in Hadoop systems, can shackle I/O speeds to the rate of the slowest drive in an array. In addition, a single disk failure can take down an entire array and temporarily knock all of a cluster node's data offline. As a result, various Hadoop vendors and consultants recommend configuring the disks in a cluster as separate devices or limiting RAID striping to pairs of disks.
Also, because Hadoop is so often combined with supporting software such as HBase and Hive, pinpointing the sources of performance problems can be, well, problematic. In working with clients to optimize cluster performance, Moore and his fellow consultants find that in many cases the first suspect isn't necessarily the culprit. "We've been brought in for technology assessments by people who think they had an issue with HBase failing," he said. "But the fact is, the problem could be with how their workflow is set up -- how they're rolling jobs into a cluster."
More needed than more nodes
The use of commodity servers makes it relatively inexpensive to add more nodes to a cluster. And with the fast-paced growth of Google, Twitter and other Web powerhouses, and the corresponding expansion of their data processing requirements, scaling out clusters as needed to boost performance became a common strategy. But that approach isn't likely to fly in more traditional organizations, said Vin Sharma, director of product marketing for Hadoop at Intel Corp.
For more advice on Hadoop implementation
Read about how to fit Hadoop and data warehouses together
Get Wayne Eckerson's take on the potential payoffs of using Hadoop systems
Learn steps to take in evaluating and selecting Hadoop technologies
"It's true that 'throw another node at it' may have become a mantra at fast-growing 'Web monsters,' but it won't be repeated in the typical enterprise," Sharma said. Instead, he expects to see more of a focus on troubleshooting performance problems. Doing so in a Hadoop cluster, though, "is more complicated than in the average system," he said. "It requires expertise that not every organization has in-house."
The first order of business once a cluster is set up, according to Sharma, is to deploy performance monitoring tools to help identify bottlenecks. He also recommends checking MapReduce applications to ensure that they've been designed for optimal performance on a cluster. "If [an application] requires a lot of network communication, it may not be a good fit," he said.
Hadoop itself might not be the right choice to begin with: The high-fevered interest in the technology shouldn't obscure the fact that it isn't the best option for every application, cautioned Tony Cosentino, an analyst at Ventana Research in San Mateo, Calif. "Don't think about technology first," he said. "Think first about the business problem you're trying to solve, because you may not even need a Hadoop cluster."
And while it's tempting, the inclination to follow the lead of the Internet giants down the Hadoop path shouldn't be an absolute, Manor said, noting that the needs of those companies and other types of businesses are often different. "The tools to solve the [online] scalability issue are not always a good fit for enterprise challenges," he said.
Hadoop workflow management manages tasks for big data services