Joshua Resnick - Fotolia

New tools offer a better view into managing Hadoop clusters

Running a Hadoop cluster in the data center isn't for the weak. But several new tools aim to give IT operations teams a closer look into what's going on inside Hadoop-based big data systems.

The breadth of the Hadoop ecosystem's components is one of its strengths, but that also feeds a weakness: Developers...

can find a streaming framework, columnar data store or other kind of architecture building block fit for almost any big data purpose, but making those pieces work together in applications can be challenging.

In different ways, new tools for managing Hadoop big data systems seek to address this issue. Included among them are Hadoop deployment automation tools from BlueData Software, open source diagnostic software developed by LinkedIn and Apache Ambari enhancements by Hortonworks that help to better visualize the health of operational Hadoop clusters.

The need for better management tools can be most pressing when Hadoop systems go into production use. That often means one-off Hadoop jobs are moved to centralized clusters to run along with other departments' jobs, typically with a variety of Hadoop components hosted on a single system. Setting up varied configurations, and assigning priorities in processing operations, can be daunting.

Startup BlueData is looking to ease those challenges with its EPIC software platform, short for Elastic Private Instant Clusters. The software supports the Apache Spark processing engine, as well as Hadoop distributions from Cloudera and Hortonworks. Moreover, the BlueData platform uses Docker container technology, which packages an application and its supporting software as a self-contained unit, to provide a multi-tenant approach to Hadoop deployment, according to a BlueData spokesperson.

Birth of the BlueData

BlueData continues to target user pain points, most recently with a March release that supports allocation of processing priorities for Hadoop jobs based on quality-of-service policies and adds quota enforcement capabilities for multi-tenant deployments.

"Hadoop has been kind of a pain to configure. I've had to jury-rig a lot of it," said Shannon Quinn, an assistant professor of computer science at the University of Georgia. Quinn works with students who need to create Hadoop environments as part of their research projects. His work as a principal researcher involves using Hadoop to support large-scale studies on computer vision and pattern recognition that can incorporate various data types, including Twitter data.

Quinn is using BlueData's EPIC platform as part of a proof-of-concept project with good results so far. "Now we can segment jobs out, and they all have their own virtual pools, where we can set priorities," he said.

A benefit he sees in using containers is that they offer a much lighter software stack. Quinn is able to work with BlueData's tools to spin up his own custom containers. While this type of work still requires tech savvy, he said he sees BlueData working to ease the development skills requirements.

For Quinn, cost is one of the issues to address in going from POC to production. He judged the economics of BlueData as favorable to a comparable set up in the Amazon Web Services (AWS) cloud, but said he still needs to attract other researchers to share the cost of going into production. (BlueData EPIC Enterprise is priced at $500 per physical processor core annually, with volume discounts.)

Latency for the system compares favorably with running Elastic MapReduce, Amazon's Hadoop platform, on the AWS Elastic Compute Cloud, Quinn said. "Compared to AWS in general, it's faster. With AWS, it can be 'the luck of the draw' -- you can guarantee [performance for] a processing region, but that's about it."

Calling Dr. Elephant

With Hadoop as with technologies that have come before it, the leap from development to managed operations can be painful. One problem is tuning jobs to ensure they don't compete with each other for resources on Hadoop clusters.

Data scientists and data engineers at LinkedIn often find themselves laboriously tracking down performance problems when jobs they've written begin to run regularly in production, according to Carl Steinbach, a senior staff software engineer and the technical lead on the company's Hadoop development team. Getting jobs to run effectively can be a daunting task, because Hadoop's many components -- think of Apache Pig, HBase, Spark, MapReduce and so on -- combine to make a jumble of virtual dials and knobs that need to be set correctly.

Hadoop is powerful, but from a user perspective it can be a real mess.
Carl Steinbachsenior staff software engineer at LinkedIn

"Hadoop is powerful, but from a user perspective it can be a real mess," Steinbach said. "If you like knobs, it gives you more knobs than you would ever want."

For some time, LinkedIn has analyzed processing workflows in Hadoop and advised their developers on how to improve them. But as Hadoop found more uses, that became more difficult. So, LinkedIn has created performance monitoring and tuning tools that automate the process. In a nod to Hadoop's mascot, the monitoring software was dubbed Dr. Elephant.

A Hadoop cluster with a view

Having been "trained" on best practices for Hadoop deployment, Dr. Elephant observes processing activity and gives data scientists and others advice on how to tune their Hadoop jobs to play nice in the data center.

"It works in a way very close to the medical analogy," Steinbach said. "You go for a checkup. Blood samples go to a lab. If you find out you have high blood pressure, you're told you need to reduce salts and take medicine."

Dr. Elephant functions similarly, he said. Jobs run on clusters, creating logs and metrics. The software retrieves this data from a Hadoop cluster's YARN resource manager and runs heuristics on it to determine how well jobs are performing. The resulting information is made available to owners of Hadoop jobs via a visual dashboard.

LinkedIn made the Dr. Elephant code available this month as an open source project under an Apache Version 2.0 license. Better Spark integration and visualizations of resource usage, as well as updates to the heuristics, are anticipated for future releases, Steinbach said.

Visualizing Hadoop management metrics

New dashboards and data visualizations are also being added to Ambari, an open source Hadoop management tool that's spearheaded by Hortonworks. An Ambari 2.2.2 release due out by month's end will include pre-built dashboards that give Hadoop systems administrators visualized views of resource usage across clusters, as well as metrics on overall cluster health.

The information provided by the dashboards should significantly expand the ability to monitor and manage large clusters via Ambari, according to Matthew Morgan, vice president of product and alliance marketing at Hortonworks.

Hortonworks also is working to integrate Atlas and Ranger, open source technologies that respectively provide data governance and security administration capabilities. The linkage, currently available as a technical preview, lets IT teams classify data by applying metadata tags in Atlas and then use Ranger to enforce data access policies based on the tags.

Hortonworks, LinkedIn and BlueData aren't alone in their efforts to bring more clarity to the big data administrator's view on Hadoop clusters. Hortonworks rival Cloudera has rolled out an updated version of Cloudera Manager meant to provide better insight into Hadoop workload activity, and startup Pepperdata has created a Hadoop cluster manager that supports self-adjusting workloads.

Executive editor Craig Stedman contributed to this report.

Next Steps

Learn more about Hadoop's inner workings

How to choose between Hadoop clusters and a data warehouse

Get more advice on Hadoop deployment and management

Dig Deeper on Hadoop framework