Managing Hadoop projects: What you need to know to succeed
A comprehensive collection of articles, videos and more, hand-picked by our editors
The Hadoop distributed processing framework presents IT, data management and analytics teams with new opportunities for processing, storing and using data, particularly in big data applications. But it also confronts them with new challenges as they look to deploy and work with Hadoop systems. And because Hadoop and the large number of open source technologies surrounding it are evolving quickly, organizations must be prepared for frequent updates and changes -- most immediately in the form of the newly minted Hadoop 2 release.
Hadoop 2, which the Apache Software Foundation made generally available on Oct. 15, will eventually take the framework far beyond its current core configuration, which combines the Hadoop Distributed File System (HDFS) with Java-based MapReduce programs. Early-adopter companies are using that pairing to help them deal with large amounts of transaction data as well as various types of unstructured and semi-structured data, including server and network log files, sensor data, social media feeds, text documents and image files.
Hadoop typically runs on clusters of commodity servers, resulting in relatively low data processing and storage costs. And because of its ability to handle data "with very light structure, Hadoop applications can take advantage of new information sources that don't lend themselves to traditional databases," said Tony Cosentino, vice president and research director at Ventana Research in San Mateo, Calif.
But Cosentino added in an email that implementations of the existing Hadoop architecture are restricted by its batch-processing orientation, which makes it more akin to a truck than a sports car on performance. "Hadoop is ideally suited where time latency is not an issue and where significant amounts of data need to be processed," he said.
In its HDFS-MapReduce configuration, "Hadoop is very good at analysis of very large, static unstructured data sets consisting of many terabytes or even petabytes of information," said William Bain, CEO of ScaleOut Software Inc., a vendor of data-grid software in Beaverton, Ore. As an example, he cited a sentiment analysis application on "a huge chunk of Twitter data" aimed at discerning what customers are thinking -- and tweeting -- about a company or its products.
Like Cosentino, Bain emphasized that, because of its batch nature and "large startup overhead" on processing jobs, Hadoop generally hasn't been useful in real-time analysis of live data sets -- although that could change, thanks to the combination of Hadoop 2 and new query engines recently introduced by some vendors looking to support ad hoc analysis of Hadoop data.
Data warehouse doors open for Hadoop
Data warehousing applications involving large volumes of data are good targets for Hadoop uses, according to Sanjay Sharma, a principal architect at software development services provider Impetus Technologies Inc. in Los Gatos, Calif. How large? It varies, he said: "Tens of terabytes is a sweet spot for Hadoop, but if there is great complexity to the unstructured data, it could be tens of gigabytes."
For more on Hadoop
Learn about avoiding Hadoop performance bottlenecks
Consider new ways to program big data apps
Discover new Hadoop enterprise features
Some users, such as car-shopping information provider Edmunds.com Inc., have deployed Hadoop and related technologies to replace their traditional data warehouses. But Hadoop clusters often are being positioned as landing pads and staging areas for the data gushing into organizations. In such cases, data can be pared down by MapReduce, transformed into or summarized in a relational structure and moved along to an enterprise data warehouse or data marts for analysis by business users and analytics professionals. That approach also provides increased flexibility: The raw data can be kept in a Hadoop system and modeled for analysis as needed, using extract, load and transform processes.
Sharma describes such implementations as a "data lake for downstream processing." Colin White, president of consultancy BI Research in Ashland, Ore., uses the term business refinery . In a report released in February 2013, Gartner Inc. analysts Mark Beyer and Ted Friedman wrote that using Hadoop to collect and prepare data for analysis in a data warehouse was the most-cited strategy for supporting big data analytics applications in a survey conducted by the research and consulting company. An even 50% of the 272 respondents said their organizations planned to do so during the next 12 months.
The vibrancy of the open source ecosystem that surrounds Hadoop can hardly be overstated.
From its earliest days, Hadoop has attracted software developers looking to create add-on tools to fill in gaps in its functionality. For example, there are HBase, Hive and Pig -- respectively, a distributed database, a SQL-style data warehouse and a high-level language for developing data analysis programs in MapReduce. Other supporting actors that have become Hadoop subprojects or Apache projects in their own right include Ambari, for provisioning, managing and monitoring Hadoop clusters; Cassandra, a NoSQL database; and ZooKeeper, which maintains configuration data and synchronizes distributed operations across clusters.
YARN spins new flexibility in Hadoop 2
And now Hadoop 2 -- originally known as Hadoop 2.0 -- is entering the picture. Central to the update is YARN, an overhauled resource manager that enables applications other than MapReduce programs to work with HDFS. By doing so, YARN (a good-natured acronym for Yet Another Resource Negotiator) is meant to free Hadoop from its reliance on batch processing while still providing backward compatibility with existing application programming interfaces.
"YARN is the key difference for Hadoop 2.0," Cosentino said. Instead of letting a MapReduce job see itself "as the only tenant on HDFS," he added, "it allows for multiple workloads to run concurrently." One early example comes from Yahoo, which has implemented the Storm complex event processing software on top of YARN to aid in funneling data about the activities of website users into a Hadoop cluster.
Hadoop 2 also is due to bring high availability improvements, through a new feature that enables users to create a federated name (or master) node architecture in HDFS instead of relying on a single node to control an entire cluster. In addition, it adds support for running Hadoop on Windows. Meanwhile, commercial vendors are brewing up additional management-tool elixirs -- new job schedulers and cluster provisioning software, for example -- in an effort to further boost Hadoop's enterprise readiness.