Managing Hadoop projects: What you need to know to succeed
A comprehensive collection of articles, videos and more, hand-picked by our editors
As commercial software lines go through revision after revision, it can be interesting to look at what each "rev" adds. That's true with Apache Hadoop-based products, which are morphing at a fevered pitch. The eventual path to expanded Hadoop use cases in the enterprise will be defined by such additions.
The changes aren't window dressing. The updates point to issues that people are finding when they apply Hadoop in their organizations. Hadoop has gained interest because it creates parallel clusters, based on commodity servers, that provide low-cost processing and storage for the unstructured data, log files and other forms of big data that now are piling up in companies, sometimes overwhelming or not meshing with data warehouses. But more work is needed for Hadoop to fit into most IT shops.
Hadoop performance can be creaky at times. The technology can be hard to program and is by no means crash-proof. Like any relatively new software, it doesn't come with a lot of tools for managing its operations. To top off a quick list of Hadoop limitations: It isn't an out-of-the-box fit for analytics applications, which are becoming perhaps the major use case for Hadoop data.
A slew of Hadoop fixes take the form of a "Hadoop ecosystem" of add-on modules with names like Flume, Pig, Hive, ZooKeeper and Oozie. There's also HBase, a NoSQL columnar database meant to work with Hadoop.
Getting to HBase
With its release of MapR M7 this month, MapR Technologies Inc. is looking to remove some of the obstacles in Hadoop's path to the enterprise. MapR's founders include a member of an early Google MapReduce search software team, giving the San Jose, Calif., company a historical view on Hadoop and its limits. Now, for analytics undertakings, MapR M7 brings HBase into MapR's Hadoop distribution, but it does so in a unique way.
HBase deployments scale by spreading database table regions across the servers in a cluster. Some users have found that can be an issue for performance, availability and database mirroring, according to Jack Norris, MapR's vice president of marketing. MapR worked to address those restrictions, he said. The company had already rewritten its underlying Hadoop layer for improved performance and reliability. This work led the way for a more integrated Hadoop-HBase implementation that cuts two layers of Java virtual machines (JVMs) that can slow performance, Norris said.
Meanwhile, with the release of its Developer Suite and Developer Sandbox tools in March, Continuuity Inc. in Palo Alto, Calif., is trying to address developer skills issues related to Hadoop. Its founders include early Hadoop and HBase hands from Yahoo and Facebook. Their experience tells them that development teams will need higher-level application programming interfaces (APIs) and practical code libraries to take Hadoop and HBase mainstream. The tools work with Continuuity's App Fabric cloud platform.
For more on Talking Data
What's behind Actian-ParAccel deal, Pivotal plan?
Paint with a richer data management palette
"At Yahoo we found it difficult to build apps on Hadoop," Continuuity CEO Todd Papaioannou said. "It's low-level infrastructure. It was difficult even to just get data in and out." With Continuuity's software, teams still have to program in Java, but the APIs allow a wider group of Java developers to become big data application developers, he maintained.
"Right now, to be a Hadoop developer, you really have to be a distributed system expert," Papaioannou said. In the legendary days of Hadoop creation, there were plenty of such experts working at Yahoo and Google and looking for ways to solve the problems of search. But these skills are less common in the typical enterprise. Like other vendors, Continuuity takes its task to be efficiently connecting and interfacing to components of the Hadoop ecosystem, but in this case, with a special effort to enable better developer productivity.
Without such an approach, Papaioannou said, Hadoop development resembles the early PC days of the HomeBrew Computer Club, where techies mixed and matched motherboards, chips and other parts.
More twists ahead on the Hadoop path
Hadoop was originally designed to run Web searches in batches across distributed systems, but much more was quickly envisioned. The technology still has a long way to go to make good on some of those visions, suggests Wayne Eckerson, director of the BI Leadership Research unit at TechTarget Inc., which is also the parent company of SearchDataManagement.com.
Eckerson points out analytical Hadoop obstacles and opportunities in a recent blog post on TechTarget's BeyeNetwork site. Companies today, he writes, use Hadoop most often as "a gigantic extraction and transformation engine." That's good, perhaps, but not the future some have touted.
A big step forward in creating new enterprise Hadoop use cases, Eckerson says, is to enable users to analyze the data directly in Hadoop systems with tried-and-true SQL-based tools. Early adopters are asking their vendors to make such capabilities a priority, writes Eckerson, who sees new SQL query engines -- for example, Cloudera Inc.'s Impala and Greenplum's Hawq -- becoming a growing part of the Hadoop product mix. He cautions, however, that it's too soon to pass judgment on their ability to effectively support real-time querying in Hadoop.
Surely, the Hadoop way is unfolding, and more product visions and versions can be expected up ahead.
Jack Vaughan is SearchDataManagement.com's news and site editor. Email him at firstname.lastname@example.org.