Choosing storage for streaming large files in big data sets
A comprehensive collection of articles, videos and more, hand-picked by our editors
Hadoop shows a lot of promise as a relatively inexpensive landing place for the streams of big data coursing through organizations. The open source technology provides a distributed framework, built around highly scalable clusters of commodity servers, for processing, storing and managing data that fuels advanced analytics applications. But there's no such thing as a free lunch: In production use, achieving high levels of Hadoop performance can be a challenge.
Despite all the attention it's getting, Hadoop is still a relatively young technology -- it only reached Version 1.0 status in December 2011. As a result, much of the work being done with Hadoop by users remains somewhat experimental in nature, especially outside of the large Internet companies that helped to create it and that are replete with Java programmers and systems administrators versed in deploying the technology.
In addition, the core combination of the Hadoop Distributed File System (HDFS) and MapReduce programming model has been joined by a continually expanding ecosystem of additional components. For example, there's Flume, which speeds the collection and ingestion of log data into HDFS; Sqoop, a tool for moving data between Hadoop and structured databases; the Oozie job scheduler; and a multitude of other supporting technologies. In many big data systems, NoSQL databases of varying stripes -- key-value stores, graph databases, document-oriented databases, others -- are also in the mix.
Hadoop users 'are dealing with an extremely deep protocol stack.'
Dominique Heger, CTO, DHTechnologies
Many of those added tools have potential performance ramifications for Hadoop systems; piecing them together so they work well in tandem with one another can be a feat. System latency, memory settings, I/O bandwidth, job parallelization and Java Virtual Machine (JVM) settings are some of the other performance variables that IT and application development teams need to take into account. Otherwise, the best-laid data architecture plans could be stymied when companies try to put Hadoop into use.
Because of the effort it can take to build well-designed Hadoop clusters and applications that won't bog down, there often are advantages to using commercial distributions of the software instead of the basic Apache Hadoop open source download, said Dmitriy Kalyada, a research and development engineer at software development services provider Altoros Systems Inc. in Sunnyvale, Calif. The commercial distributions also offer tools for more effectively monitoring the performance of the nodes in a cluster, he added.
Twists and turns on the Hadoop performance path
While adding more nodes is sometimes a straightforward path to improving Hadoop system performance, that isn't always the case. Kalyada said you might expand a four-node Hadoop cluster to 20 nodes and not get a comparable five-fold increase in performance. In such cases, data architects and developers may have to go back to the drawing board and tweak their designs to get the desired throughput.
Getting data into and out of Hadoop at a quick clip can also be an issue because of its batch processing orientation, although the Hadoop community is actively brewing up solutions to the problem of batch latency. For example, several vendors have developed tools that support ad hoc analysis of Hadoop data through SQL queries or mainstream business intelligence tools. Also, a Hadoop 2.0 release that became available for beta testing in August makes MapReduce an optional component and lets users run other kinds of workloads, including real- or near-real-time processing jobs.
Meanwhile, some organizations are using complex event-processing engines to goose their Hadoop performance. Even Yahoo Inc., the company where Hadoop was first hatched, ran into problems with the technology, which had trouble keeping up with incoming information about the activities of online users that Yahoo wanted to correlate with its inventory of available ads.
Yahoo collects data on "billions" of such user events daily, according to Bruno Fernandez-Ruiz, a senior fellow and vice president for platforms at the company. But the data-collection process doesn't stop to accommodate Hadoop. "The problem with MapReduce computing is the batch window," he said. "You have events coming in while running a batch job that is going to run two or three hours."
Knitting together a performance booster
In response, Yahoo implemented Storm-on-YARN, an application that pairs Hadoop 2.0's new resource-management software with Storm, an open source complex event-processing tool. Fernando-Ruiz, who spoke at the Hadoop Summit 2013 in June, said that while MapReduce still handles long-running jobs, Storm processes low-latency events that can be added at the end of the batch run in order to provide a more complete view of website user activity.
For more on performance
Learn about data warehouse performance
Understand performance tradeoffs in SAP DB integration
Tune in to tune performance and capacity
Hadoop users "are dealing with an extremely deep protocol stack," said Dominique Heger, owner and chief technology officer at DHTechnologies, an IT services company in Dripping Springs, Texas. And because the various tools communicate with one another through Java application programming interfaces, he continued, there are ample opportunities for Java coding practices to affect performance. "You have to do a good study first to see how the Hadoop components will interact if you're going to get the performance you're looking for," Heger said.
Another way to enable good performance, Heger said, is to hew to the Hadoop mantra of moving computation to the data -- i.e., distributing processing tasks to server nodes that are close to where the relevant data is stored in a cluster. But, he added, that's the opposite of the approach most developers and database administrators are familiar with: moving data to an application for processing.
For many companies, there's still a lot to learn about Hadoop -- and a need to tread carefully. Hadoop and other big data technologies are "a minefield" for the uninitiated, said Kent Collins, a database engineer and architect at BNSF Railway Corp. in Fort Worth, Texas. But with time, experience and effective project management practices, the promise of fast Hadoop performance can come to pass.