michelangelus - Fotolia
Data isn't the only thing that needs to be governed in big data systems. The queries run by data scientists and other users also have to be watched to make sure they don't bog down processing in Hadoop and Spark clusters.
Hadoop performance problems became an issue at BT Group PLC after use of its data lake environment started rising rapidly in early 2016 as production applications began proliferating. "We had a bow wave of demand from users," said Jason Perkins, head of business insight and analytics architecture at the London-based company.
Eventually, the communications and TV services provider had to "close the doors" to new users for a few months while it added more compute nodes to the Hadoop system, Perkins said. Properly balancing the "very mixed workload" of big data processing jobs remains a challenge, he added. And it could become a greater challenge -- BT plans to expand the number of applications in the cluster from about 100 as of April to 500 by year's end.
A fix for what ails Hadoop queries
LinkedIn Corp. ran into similar issues in its Hadoop and Spark environment, which has grown to more than 10,000 nodes across multiple clusters accessed by thousands of users. In particular, the company found that overall processing performance would suffer if individual jobs weren't tuned properly, said Carl Steinbach, a senior staff engineer at LinkedIn who heads its Hadoop development team.
At first, the Hadoop team tried to avoid such problems by meeting with users to review proposed queries and suggest changes. But that could take weeks -- and then the users had to "get back in the queue for another meeting," Steinbach said. "It wasted a lot of time for both them and my team."
To accelerate the process, LinkedIn developed a tool called Dr. Elephant that monitors Hadoop performance and identifies problematic big data queries. The web-based tool runs on its own cluster node, continually analyzing system logs to find "sick jobs" and then offering advice on how to "treat" them, Steinbach explained.
Carl Steinbachsenior staff engineer, LinkedIn
The Mountain View, Calif., company, now owned by Microsoft, began using Dr. Elephant in 2015 and open sourced it last year. In tracking queries, Dr. Elephant provides "sort of a soft-governance model," Steinbach said. "It does shine a light on what's happening in the cluster. Everyone has a view of what everyone else is doing, and that motivates people to do the right thing."
Software vendor Pepperdata this year added a product based on Dr. Elephant to its set of tools for managing Hadoop clusters and governing their use. A variety of other commercial and open source cluster management tools are also available from big data platform vendors such as Cloudera and Hortonworks as well as third-party software developers akin to Pepperdata.
Proper balance keeps costs in check
Balancing workloads to maximize Hadoop performance in an affordable way is also a concern for Marc Gallman, senior manager of big data at PC maker Lenovo Group Ltd. Gallman, who works at Lenovo's U.S. headquarters in Morrisville, N.C., oversees a big data architecture that combines an on-premises Hadoop cluster with systems running in the Amazon Web Services cloud.
Lenovo is looking to move most of the big data processing work to the cloud to support more-real-time analytics on marketing and internet clickstream data than can be done via the batch jobs now run in the on-premises cluster, Gallman said. The goal, he added, is to enable the company to do more targeted marketing while spending its advertising budget more efficiently and effectively.
But to avoid ballooning what his team spends on data processing, Gallman said it may make sense to continue running some analytical algorithms and queries in batch mode internally. "Not every algorithm needs to run in real time," he noted. "It won't be cost-effective or beneficial to just drive everything through that pipeline."
More examples of real-world Hadoop management and governance strategies
Big data vendors look to simplify deployment of cloud-based Hadoop systems
Abundant technologies make building big data architectures a daunting proposition
LinkedIn open sources a Hadoop test simulator called Dynamometer