alphaspirit - Fotolia
In big data development, trial and error are still the bywords, especially when it comes to tuning Apache Hadoop jobs to run in production. But lessons learned by teams at a social networking giant, embedded in new application profiling tools, are coming online to help move big data apps ahead in enterprises.
Due this week as an early-access release, Pepperdata's Application Profiler is based on Dr. Elephant, an Apache open source project launched last year by LinkedIn. Dr. Elephant project contributors include Airbnb, Foursquare, Pepperdata and others.
The Dr. Elephant software parses through activity logs, applies rules for known-good deployments and provides a report on the health of Hadoop applications. That can be a helpful measure for teams trying to run multiple Hadoop jobs. Often, such teams have had little guidance and have needed deep levels of Java programming skills to determine the efficiency of their Hadoop clusters.
Since last year's open-sourcing, rules for Apache Spark deployments have been added to Dr. Elephant, according to Carl Steinbach, who is a senior staff software engineer at LinkedIn, based in Mountain View, Calif. Among other recent additions to Dr. Elephant is support for the Oozie and Airflow workflow schedulers. Enhancements to the Spark history fetcher are also now part of the package.
In effect, Dr. Elephant takes on the difficult task of reviewing job performance and suggesting tunings, according to Steinbach. With Dr. Elephant, "you don't have to learn how to tune a job. It tells you why changes need to be made," Steinbach said. "You don't have to be a Java mechanic on the side."
The ability of less-than-superstar programmers to tune Hadoop is an important one, based on Steinbach's experience. As the technical lead on LinkedIn's Hadoop development team, Steinbach and his colleagues have had a role in bringing many Hadoop applications into production. They've sat through many code review meetings and have come to distill some best practices that combine to make Dr. Elephant, so a wider range of programmers can succeed with Hadoop in production.
Finding fixes for sick jobs
Steinbach said he has seen a positive effect at LinkedIn, as use of Dr. Elephant has expanded. That is because it presents individual job profiles in the context of other jobs running on big data clusters. There, positive reinforcement is at work at times, as teams see their job's performance, along with others.
"You see not just your jobs, but other people's jobs, as well," Steinbach said. "And, if I see my job is 'sick,' it presents a kind of positive community pressure to do the right thing." Here, "doing the right thing" means fine-tuning job performance.
Dr. Elephant rules, or heuristics, in some ways resemble first-year medical-school texts, Steinbach said, where symptoms are observed and useful fixes are suggested.
For example, performance bottlenecks that may be uncovered by Dr. Elephant could include MapReduce processing skew issues or too much garbage-collection overhead. The response in the first case may be to repartition data, and, in the second case, the response may be to tweak garbage-collection settings in the Java Virtual Machine, he said.
Steinbach said, since being open-sourced, Dr. Elephant has been applied to a larger list of big data workloads, and thus it has come to encompass a greater diversity of Hadoop and Spark job types. He noted that Pepperdata has contributed patches back to the open source project and is in a position to contribute significant new features.
Tracking cluster bottlenecks
Tools like Dr. Elephant help to forward the notion of DevOps, where developers have a greater role in ensuring applications run well in operations. The tool does that without requiring a sit-down meeting between system operators and developers.
Application profiling tools that give developers a better view of what works in practice are key to DevOps for big data, according to Ash Munshi, CEO of Pepperdata, based in Cupertino, Calif. He said Pepperdata's Dr. Elephant-based Application Profiler is offered as a service, and it provides additional contextual data to better convey what happens on large banks of clusters as big data jobs are running. It joins a Pepperdata software suite that includes a cluster analyzer, capacity analyzer and a policy enforcer.
So far, a lot of the tools available for Hadoop performance tracking have been oriented toward system operators, according to Munshi, but he said he anticipates that the number of application profiling tools for developers will expand. He said general availability is expected in the second quarter of 2017.
Learn about cluster management in a high-traffic mobile gaming app
Gain valuable information on Hadoop data management planning
Find out more about tools for monitoring big data platform performance