Managing Hadoop projects: What you need to know to succeed
A comprehensive collection of articles, videos and more, hand-picked by our editors
Hadoop often enters an organization on the backs of multiple data processing jobs created by various users. But those jobs can conflict with one another and contend for the available processing resources once they're all running on a Hadoop system.
The problem of Hadoop cluster management recently confronted David Clubb, senior data engineer at mobile-gaming platform and marketplace company Chartboost, as he sought to apply production best practices to the big data technology. "We were updating our [Hadoop] software, and migrating to a new cluster, but there wasn't a good way to look into the cluster," Clubb said. "You could see what jobs were running, but you really couldn't see how the resources were being used."
That was bad because lower priority extract, transform and load processes and analytics jobs could steal CPU cycles from jobs with higher priority, such as immediately necessary reports. At the same time, some compute nodes in the cluster might be left relatively underutilized, according to Clubb. The issues led him to implement new software that provides both a better window into the activities of cluster resources and tools for real-time optimization of processing workloads, which include a mix of MapReduce, Hive and Spark applications.
Spinning in place with YARN
Installed this year, the Hadoop system handles more than 1 billion events per day, creating a large pool of data for use by business product line managers looking to understand how the Chartboost platform is being used both by gamers and game developers. At the outset, Chartboost, which runs Cloudera's Hadoop distribution on the Amazon Web Services cloud, relied solely on the open source framework's built-in Hadoop YARN scheduler to set up its workloads. But using YARN didn't let the San Francisco company give any special treatment to the high-priority jobs, Clubb said.
He added that Hadoop creates some information for cluster management uses, but because of its open source architecture, the data was split up into different places. Homemade scripts solved some of the management issues but also fell short of the desired mark.
Along the way, Clubb found software from startup Pepperdata Inc. that gave him a deeper view into I/O, memory and CPU use across the Hadoop cluster. Even more important for Chartboost, he said, the Pepperdata software can automatically slow down lesser jobs, giving additional running space to more crucial applications, and help ensure that all of the cluster's compute nodes are used as effectively as possible.
"We tried to figure out before how best to manage the workload -- you don't want to over- or underutilize your nodes," Clubb said. "Pepperdata utilizes the hardware efficiently. It dynamically adjusts to the work."
Using the software has led to savings on Chartboost's cloud computing bill, through a reduction in the number of cluster nodes required by the company. Clubb said its Hadoop system currently amounts to 22 nodes, down from 33 in the original deployment.
Multiple Hadoop management options
Pepperdata is one of various vendors -- including traditional systems management software makers, Hadoop distribution providers and other startups such as Concurrent Inc. -- taking different approaches to automating Hadoop cluster management processes. The ability to self-adjust ongoing work based on processing priorities is a key element of Pepperdata's software, according to Chad Carson, a co-founder of the startup in Sunnyvale, Calif.
When companies put Hadoop big data services into production use, IT operations teams likely will need to guarantee application performance as part of service-level agreements, Carson said. That, he thinks, will make understanding how Hadoop is utilizing cluster resources -- and enabling usage of them to be modified on the fly -- increasingly important.
Newer Hadoop ecosystem members, such as the Apache Spark processing engine, can further exacerbate cluster management problems, Carson added. "Spark does more, more quickly," he said. "But it has big spikes in usage -- you end up seeing Spark jobs interfering with other jobs. Or you see a Spark or HBase workload with latency constraints, with which a lower priority MapReduce [job], for example, can interfere."
Hadoop cluster management traffic cop
Software like Pepperdata's could help address some of the issues that continue to stymie enterprise adoption of Hadoop, said Mike Matchett, an analyst at Taneja Group in Hopkinton, Mass.
"It's one thing to do an application performance management system -- it's another to have a real-time controller that can optimize the system dynamically," Matchett said. "If you have one big cluster used for multiple things, it needs a traffic cop."
For Clubb, the next step is likely to be increased use of Spark to process Hadoop data. He said early work with a version of Pepperdata's software that supports the processing engine has had positive results for Chartboost. Clubb added that he's confident he can move more workflows to Spark without encountering the earlier cluster management problem that "more jobs meant it was more likely you'd run out of resources."
Learn more about Hadoop's inner workings
How to choose between Hadoop clusters and a data warehouse
Get more advice on Hadoop deployment and management