michelangelus - Fotolia

Users balance Spark support from vendors, access to new features

Apache Spark users are often faced with a quandary: continue with vendor support or break out on their own to a newer version of the fast-moving open source software with updated features?

Novantas Inc. gets the Spark data processing engine as part of a commercial Hadoop distribution. But that ties the analytics services and software provider to the Spark release supported in the version of the Hadoop bundle it's running from vendor Cloudera Inc. As a result, it doesn't necessarily get immediate access to new Spark functionality.

For example, Novantas used Spark 1.4 in a Hadoop-based application it developed early this year to help analytics teams at banks find relevant customer and financial data in internal systems. That release, issued by the Apache Software Foundation in June 2015, had been superseded by Spark 1.5 last September and 1.6 in January. But to upgrade, Novantas would have had to give up Spark support through Cloudera -- and that wasn't something it was keen on doing.

"We don't want to use an unsupported version," said Kaushik Deka, CTO and director of engineering for the New York-based company's Novantas Solutions technology unit. That's especially so because the application was the group's first real foray into using Hadoop and Spark. Sticking with Spark 1.4 wasn't ideal, though. Deka said some aspects of it were kludgy, adding that he hopes that will be resolved when Novantas does move to a newer release of the technology.

Such considerations are common in big data environments, which typically involve open source technologies that are updated at a rapid pace. The situation is particularly acute with Spark: It got a total of 18 releases through Apache between July 2014 and July 2016, when a Spark 2.0 version became generally available.

To keep from falling behind on new features, some organizations have eschewed vendor-provided Spark support and deployed the base Apache Spark software on their own.

For example, Synchronoss Technologies Inc. sourced Spark from Hadoop vendor MapR Technologies in 2014 -- initially through Razorsight Corp., which Synchronoss bought a year later. But Suren Nathan, senior director of big data analytics platforms at the Bridgewater, N.J., mobility management company, said it sometimes has upgraded directly to a new release of Apache Spark to get desired functionality. "By now, my team is pretty proficient with Spark," he said.

Webtrends Inc. also has been using the base Spark software. "We're trying to stay as current as possible on releases," said Peter Crossley, CTO at the online activity tracking company in Portland, Ore. "There's nothing in the market that's moving as fast as this technology."

Ultimately, though, Crossley said he would prefer to get a supported version of Spark through Hortonworks Inc., his Hadoop vendor. To try to make that feasible, he said, his team worked with Hortonworks on a two-track release plan that the vendor adopted last March to speed up its delivery of Spark and several other big data technologies associated with Hadoop.

Next Steps

IBM, Microsoft, other vendors expand Spark support as 2.0 release comes into play

Expert Q&A: Apache Spark "has arrived"

Why companies are deploying Apache Spark

Dig Deeper on Big data management