Managing Hadoop projects: What you need to know to succeed
A comprehensive collection of articles, videos and more, hand-picked by our editors
Hadoop and related software tools have been popping up all over the map, and just last week, two potentially significant commercial distributions of the open source software came forward. These Hadoop distributions come from Intel Corp. and EMC Corp. Both are big industry presences, and their Hadoop offerings will get attention. But veterans of past software wars may be led to ask, "How many Hadoop 'distros' are too many?"
Only a day after EMC showcased its Pivotal HD Apache Hadoop distribution, Intel announced the global availability of Intel Distribution for Apache Hadoop software. EMC boldly touted its package as "the world's most powerful Hadoop distribution." In only slightly less muted tones, Intel described its version of Hadoop as "built from the silicon up to deliver industry-leading performance and improved security features." In many ways, the Hadoop space is becoming a crowded field.
Eventually, people will choose just two or three distributions. That's just the way markets always work.
research director, Bloor
"It's a bit like the Unix wars of 15 years ago," said Philip Howard, research director for U.K. IT analyst firm Bloor, recalling the time when many Unix versions roamed the data center. Unix begat Linux, which now revolves around a smaller handful of popular distributions. "Eventually, people will choose just two or three distributions," he said, referring to Hadoop. "That's just the way markets always work."
Hadoop distributions portfolio full
Coming out of work done at Yahoo, the Hadoop Distributed File System sits at the center of most Hadoop distributions. Yahoo initiated an Apache Foundation project to make Hadoop a citizen of the open source world.
But Hadoop isn't wholly isolated. There a lot of other pieces a company can choose from when it's filling out a Hadoop portfolio. These might include Pig procedural programming tools, Hive querying tools, the HBase NoSQL database and many others. In fact, Hadoop distributions tend usually to be a mix of open source and proprietary bits.
Key efforts that have carried Hadoop forward include those of Cloudera, an early Hadoop startup and provider of the Cloudera Distribution; and Hortonworks, a 2011 Yahoo spin-off that offers the Hortonworks Data Platform. Both companies are deeply involved in the Apache Hadoop open source project, but they are far from alone in the Hadoop hunt. The Apache Foundation's site lists more than 20 software packages as true Apache Hadoop or "a derivative work thereof" on its website.
Hadoop from the chip on up
Santa Clara, Calif.-based chipmaker Intel positions its Hadoop undertaking as a major software effort. Research began in 2009. Since then, Intel has begun to sell its Hadoop packages in China and elsewhere, and now is taking its Hadoop distribution global.
The company is working from its strengths, optimizing Xeon processor hardware networking and I/O functions to work with Hadoop. At launch time, Intel representatives claimed that, running on Xeon, its Hadoop distribution can reduce the time it takes to analyze 1 TB of data from four hours to seven minutes.
Silicon-level encryption is also supported. While various elements of Intel's Hadoop distribution are open source, the management software included in the stack is proprietary.
Hopkinton, Mass.-based EMC's Hadoop work appears to be part of a large company reorganization. EMC is putting its cloud, data warehousing and development software assets together under the "Pivotal" moniker, with former VMware chief Paul Maritz at the head.
More on Hadoop distributions, big data
Find out about big data problems in the enterprise
Download a TWDI podcast on Hadoop trends
Access more big data info on our Topics pages
More detail on the Pivotal effort is expected later this month, but the company has disclosed that it relies heavily on a long-running EMC Greenplum engineering project – code-named HAWQ -- which natively integrates its pipelined PostgreSQL database into Hadoop. Across the industry, the link to SQL is often discussed when Hadoop is the topic these days.
By integrating the Greenplum database, Pivotal HD opens up big data analytics to savvy business users, according to Jeff Kelly, an analyst with the The Wikibon Project. But, he cautioned, "the database is EMC's proprietary database. If you are going to go with EMC's distribution, obviously it is going to be optimized to work with the Greenplum database and not an outside database." In the open source world, this is often called "forking."
The additional distributions arise as Hadoop begins to move to a new maturity, in the form of version 2.0. This too creates challenges for data management leaders sorting through Hadoop options. Still in alpha, and considered less stable than 1.0, Hadoop 2.0 adds high availability via a failover "naming" node. Some players have begun to include Hadoop 2.0 elements in their main distributions, while others have held off.
Hadoop 2.0 improves security with new file encryption abilities. Among the associated enhancements is YARN (for "Yet Another Resource Negotiator"), an improvement to the MapReduce programming format that provides both a new per-application master negotiator and a global resource manager.
Clearly, keeping track of Hadoop and "the Hadoop ecosystem" will be an ongoing effort. For Hadoop, "the more, the merrier" is an old adage that -- for now -- seems to be sticking.
Follow SearchDataManagement.com on Twitter: @sDataManagement.