Hortonworks Inc. is changing the way it updates its Hadoop distribution, creating separate release streams for Hadoop's core components and related open source technologies surrounding the distributed processing framework.
Starting with a 2.4 version of the Hortonworks Data Platform (HDP) distribution, which became available yesterday, the company said it's trying to better balance somewhat conflicting user requests for stability and innovation by releasing software updates at two different tempos.
That means a more moderate, yearly update plan for the Hadoop Distributed File System (HDFS) and other core Hadoop elements, while new releases of "extended services," such as Hive, HBase, Ambari and the fast-rising Spark data processing engine, will be issued more frequently.
At a press conference held in San Francisco and streamed live via the Web, Hortonworks also discussed updates to its real-time data streaming platform, Hortonworks DataFlow (HDF), as well as new joint development work with Hewlett Packard Enterprise (HPE) to improve Spark's performance in analytics applications that can take advantage of large amounts of shared memory.
In a follow-up interview, Tim Hall, vice president of product management at Hortonworks, based in Santa Clara, Calif., said the company had heard about the need for a more balanced product-release strategy from customers, as well as resellers and other business partners that work with HDP.
Revised software update 'cadence'
Until now, Hortonworks typically updated the release of Apache Hadoop used in HDP whenever it came out with a new version. But there were two HDP revisions last year, and Hall said some users weren't thrilled about having to upgrade the foundations of their Hadoop clusters that often. "They've got to have that stability," he noted.
Under the new approach, Hall said, Hortonworks will stick with a single version of the core Hadoop software through each major release cycle of HDP. For example, HDP 2.4 is based on Apache Hadoop 2.7.1, unchanged from the preceding release, and the same will apply to future 2.x point releases of the Hadoop distribution. The company won't switch to another version of Hadoop for HDFS and components such as the YARN resource manager and MapReduce batch-processing environment until it comes out with HDP 3.x, according to Hall.
Another approach will be taken with faster-evolving open source frameworks and tools that work with Hadoop. Hortonworks said new versions of Spark and other technologies will be grouped together, and "released continually throughout the year." The latest HDP release includes Spark 1.6, plus updates of Ambari, which provides Hadoop cluster management capabilities, and Hortonworks' own SmartSense preventive maintenance software.
Hall, who also wrote a blog post about the new approach, clarified in the interview that the company is still looking at a relatively small number of releases on the second track. "I can't imagine we'll do that more than four times a year," he said in the interview. "That would be a lot."
Taking time to vet things properly
"These announcements seem to parallel our needs," said Mike Peterson, vice president of platforms and data architecture at Neustar Inc., an information and analytics services provider based in Sterling, Va. Peterson, who took part in the press conference, and who has worked with Hortonworks Hadoop for more than four years, said in an interview that running behind the bleeding edge can be OK.
"We're already one release behind on HDP," said Peterson, whose group runs a Hadoop cluster with 500-plus nodes. "We want the core HDP platform to be mature. New features have to be vetted out. We don't mind if some things are in a little slower pattern, because we have to make sure we do this in a controlled way."
He contrasted that with Neustar's attitude toward HDF and Spark, newer types of software for moving streams of data between different systems and accelerating data processing and analytics applications. "They're more experimental, and we're happy with that," he said.
On the flip side, Doug Henschen, an analyst at Constellation Research Inc., said Hortonworks' two-pronged release strategy will allow users to "opt in" to gain faster access to cutting-edge capabilities.
In the past, Henschen said, Hortonworks has been slower than Hadoop distribution rival Cloudera to make some new technologies available to customers. "I see this as a way to accelerate the availability of newer capabilities that are in demand."
Spark speedup plans get sorted out
Hortonworks also said it was working with Hewlett Packard Labs, part of HPE, to improve the performance of Spark for large-scale computing. Work includes enhancements to Spark's underlying "shuffle engine" technology for faster in-memory computation.
The work is intended to broaden the range of challenges that Spark can address, said Martin Fink, executive vice president and CTO at HPE, and a Hortonworks board member, stemming from a 2014 investment in the company by HPE predecessor Hewlett-Packard Co.
Fink said during the press conference that Hortonworks and HPE plan to contribute this work to the Apache Spark community. The shuffle engine is one of the elements that developers at Spark-specialist Databricks have been working on as part of their Project Tungsten update to Spark, which has produced new features that have been built into the last few releases of the big data engine.
Hortonworks positions HDP 2.4 and HDF 1.2 as ''connected data platforms.'' Taken together, the two platforms are intended to address two distinct classes of big data applications, involving data at rest and data in motion, according to Herb Cunitz, the vendor's president and chief operating officer.
The ability to handle both types of applications, which also can be framed as batch versus near-real-time data processing, is sometimes described as Lambda architecture. "We're seeing a lot of Lambda architecture helpers this year," said Henschen, who includes Hortonworks and competitors such as Cloudera and MapR in the movement.
Executive Editor Craig Stedman contributed to this story.
Check out the latest from the Spark Summit
Take a look at some Hadoop bottlenecks