Strata + Hadoop World 2016: Hadoop and Spark in spotlight
Reporting and analysis from IT events
Hadoop distribution provider Cloudera said it will increase its participation in Apache Software Foundation open...
source initiatives. The move follows some competitors' maneuverings to establish Hadoop interoperability standards outside the multiyear Apache Hadoop standards undertaking.
Cloudera's principals include some of the creators of the original Hadoop, and as a result it is closely watched. Its move to sponsor Apache, the group that oversees Hadoop ecosystem standards, came shortly after Pivotal, Hortonworks and other Hadoop players allied to form an Open Data Platform (ODP) initiative to create "a tested reference core" of Hadoop ecosystem components.
With IBM, SAS, Altiscale, Teradata and others enlisted, the ODP initiative gained attention. But without notable Hadoop forces such as Cloudera and MapR Technologies on board, the initiative -- intended to foster compatibility and interoperability -- raised the specter of both new de facto standardization and new controversy.
"We were invited to join the ODP foundation, but we decided not to," said Amr Awadallah, Cloudera CTO and co-founder, in a keynote at last week's Strata + Hadoop World 2015 conference in San Jose, Calif. He said the consortium for formulating Hadoop interoperability already exists. "It's Apache," he said.
Open source seen in RFPs
Word of ODP's formation came at the same time that Pivotal announced it will open source elements of its HAWQ, GemFire and Greenplum database technologies. In turn, Hortonworks will work with Pivotal on joint engineering of Apache Hadoop and YARN.
Leaders of Pivotal, an EMC-VMware spin-off that began to ply the Hadoop market two years ago, affirmed open source software's importance in data management today. "We are starting to see open source coming up in RFP requirements," Sunny Madra, vice president of the company's Data Product Group, said at a pre-Strata online press conference.
Madra and Pivotal colleagues suggested that more than the Apache Hadoop mark of version stability may be needed to meet the requirements of some businesses. Thus, an ODP goal is to ease the enterprise's entry into open source software by forging a common set of Hadoop ecosystem components. Today, the versioning of such components can be difficult to manage.
But with some Hadoop players on the sidelines, ODP's goals may be difficult to achieve, some industry watchers suggest. They remind that other software technologies, notably Unix and Linux, have gone through standards-related growing pains along the way.
"This is an example of a very typical pattern that we have seen over the years. Vendor-driven de facto standards form, but the membership of the group is not necessarily universal," said Ovum analyst Tony Baer.
There is value in a group that goes beyond Apache's mandate to develop releases and declare release stability while not necessarily certifying the software, Baer said. But he questioned the ODP group's potential relevance given its present composition, which doesn't include Cloudera or MapR.
ODP's quest for greater interoperability in Hadoop ecosystem components is controversial, too, at least in the view of a Hadoop user questioned at Strata.
"I don’t see interoperability problems. My personal opinion on why this is happening is you have a lot of vendors here selling pretty similar things," said Tomas Mazukna, an enterprise architect at RelayHealth, a medical claims processing system operator in Alpharetta, Ga., that has been using Cloudera's Hadoop tools for more than a year.
If his team transitioned from one Hadoop distribution to another, the biggest issue would be moving many terabytes of data into a new cluster, Mazukna said, as opposed to moving from one or another version of a Hadoop ecosystem component.
On the dynamics of ODP
Ron Bodkin, founder and president of Think Big, a consulting services company that's now part of Teradata, acknowledged that we're in "early innings" on the ODP initiative and that not having the likes of Cloudera and MapR involved "creates its own dynamic" for the effort. But Teradata is one of the charter members of the initiative, and Bodkin said he hopes ODP results in "less needless duplication and fewer interfaces to develop to."
Bodkin also said he thinks user organizations could get more input into Hadoop's development through ODP than they currently have via Apache. "It's not like users are being super well taken care of by the Apache process," he said, pointing to the significant amount of vendor participation in Apache Hadoop efforts. "Apache projects are primarily driven by commercial vendors. It's not some pure governance model."
Some users at the Strata conference were adopting a wait-and-see attitude toward ODP while seeking clarification on its objectives and how it will work. For example, Michael Brown, CTO at Reston, Va.-based comScore Inc., said he had reached out to Pivotal to try to get more information about the ODP initiative and "what it fills from a gap perspective compared to what Apache does."
ComScore -- which analyzes Internet data for online marketers, advertisers and publishers -- uses a combination of Pivotal's Greenplum database and MapR's Hadoop distribution at the heart of its analytics architecture.
Apache "does a great job of shepherding" Hadoop, Brown said, adding that he thinks there is adequate portability now between different Hadoop and MapReduce implementations. Creating roadblocks to changing from one distribution to another "is the last thing I want in the ecosystem," he said.
Includes reporting by executive editor Craig Stedman.
Discover more about Spark vs. Hadoop.