Essential Guide

Browse Sections


This content is part of the Essential Guide: Managing Hadoop projects: What you need to know to succeed
News Stay informed about the latest enterprise technology news and product updates.

EMC, Intel unveil new Hadoop distributions, but how many is too many?

IT heavyweights EMC and Intel have entered the big data ring with new distributions of Hadoop, adding more contenders to an already crowded field.

Hadoop and related software tools have been popping up all over the map, and just last week, two potentially significant commercial distributions of the open source software came forward. These Hadoop distributions come from Intel Corp. and EMC Corp. Both are big industry presences, and their Hadoop offerings will get attention. But veterans of past software wars may be led to ask, "How many Hadoop 'distros' are too many?"

Only a day after EMC showcased its Pivotal HD Apache Hadoop distribution, Intel announced the global availability of Intel Distribution for Apache Hadoop software. EMC boldly touted its package as "the world's most powerful Hadoop distribution." In only slightly less muted tones, Intel described its version of Hadoop as "built from the silicon up to deliver industry-leading performance and improved security features." In many ways, the Hadoop space is becoming a crowded field.

Eventually, people will choose just two or three distributions. That's just the way markets always work.

Philip Howard,
research director, Bloor

"It's a bit like the Unix wars of 15 years ago," said Philip Howard, research director for U.K. IT analyst firm Bloor, recalling the time when many Unix versions roamed the data center. Unix begat Linux, which now revolves around a smaller handful of popular distributions. "Eventually, people will choose just two or three distributions," he said, referring to Hadoop. "That's just the way markets always work."

Hadoop distributions portfolio full

Coming out of work done at Yahoo, the Hadoop Distributed File System sits at the center of most Hadoop distributions. Yahoo initiated an Apache Foundation project to make Hadoop a citizen of the open source world.

But Hadoop isn't wholly isolated. There a lot of other pieces a company can choose from when it's filling out a Hadoop portfolio. These might include Pig procedural programming tools, Hive querying tools, the HBase NoSQL database and many others. In fact, Hadoop distributions tend usually to be a mix of open source and proprietary bits.

Key efforts that have carried Hadoop forward include those of Cloudera, an early Hadoop startup and provider of the Cloudera Distribution; and Hortonworks, a 2011 Yahoo spin-off that offers the Hortonworks Data Platform. Both companies are deeply involved in the Apache Hadoop open source project, but they are far from alone in the Hadoop hunt. The Apache Foundation's site lists more than 20 software packages as true Apache Hadoop or "a derivative work thereof" on its website.

Hadoop from the chip on up

Santa Clara, Calif.-based chipmaker Intel positions its Hadoop undertaking as a major software effort. Research began in 2009. Since then, Intel has begun to sell its Hadoop packages in China and elsewhere, and now is taking its Hadoop distribution global.

The company is working from its strengths, optimizing Xeon processor hardware networking and I/O functions to work with Hadoop. At launch time, Intel representatives claimed that, running on Xeon, its Hadoop distribution can reduce the time it takes to analyze 1 TB of data from four hours to seven minutes.

Silicon-level encryption is also supported. While various elements of Intel's Hadoop distribution are open source, the management software included in the stack is proprietary.

Hopkinton, Mass.-based EMC's Hadoop work appears to be part of a large company reorganization. EMC is putting its cloud, data warehousing and development software assets together under the "Pivotal" moniker, with former VMware chief Paul Maritz at the head.

More on Hadoop distributions, big data

Find out about big data problems in the enterprise

Download a TWDI podcast on Hadoop trends

Access more big data info on our Topics pages

More detail on the Pivotal effort is expected later this month, but the company has disclosed that it relies heavily on a long-running EMC Greenplum engineering project – code-named HAWQ -- which natively integrates its pipelined PostgreSQL database into Hadoop. Across the industry, the link to SQL is often discussed when Hadoop is the topic these days.

By integrating the Greenplum database, Pivotal HD opens up big data analytics to savvy business users, according to Jeff Kelly, an analyst with the The Wikibon Project. But, he cautioned, "the database is EMC's proprietary database. If you are going to go with EMC's distribution, obviously it is going to be optimized to work with the Greenplum database and not an outside database." In the open source world, this is often called "forking."

The additional distributions arise as Hadoop begins to move to a new maturity, in the form of version 2.0. This too creates challenges for data management leaders sorting through Hadoop options. Still in alpha, and considered less stable than 1.0, Hadoop 2.0 adds high availability via a failover "naming" node. Some players have begun to include Hadoop 2.0 elements in their main distributions, while others have held off.

Hadoop 2.0 improves security with new file encryption abilities. Among the associated enhancements is YARN (for "Yet Another Resource Negotiator"), an improvement to the MapReduce programming format that provides both a new per-application master negotiator and a global resource manager.

Clearly, keeping track of Hadoop and "the Hadoop ecosystem" will be an ongoing effort. For Hadoop, "the more, the merrier" is an old adage that -- for now -- seems to be sticking.

Follow on Twitter: @sDataManagement.

Dig Deeper on Hadoop framework

Join the conversation


Send me notifications when other members comment.

Please create a username to comment.

Are you concerned about the number of Hadoop distributions?
The best is to have consistency among a few "prestige" distributions...
It's the quality that needs to be examined not the quantity
It is shame some of these companies steeling code without committing any to Apache
The whole thing seems to be in big mess.
I guess the "commodity" hardware aspect of hadoop's original goals is quickly dimishing. EMC not in the game becasue they want to sell more cheap servers or storage units, it's just about joining the band wagon.
users will be confused and it take them more time to choose and decide.
market can't absorb too many player
All the Hadoop distributions are basically the same with one exception, MapR.
I agree with Phillip Howard that time and the market will whittle them down but history shows the winnners are not always the best technology.
The time spent dwelling in the software industry suggests me that all the hype we see about Hadoop is a necessary part of the process. Like competition in biological systems, time will provide the surviving species, a.k.a "flavours".
more at beginning, less later. normal process for a new 'tool'.
Until we embrace a flavour, I stand by. When it is time to take a dip, I guess, we will go with the best- in the industry at that time!
time will tell if they will help towards improvement. Hadoop's dist is harware specific for me