Joshua Resnick - Fotolia
The Spark Summit in San Francisco this week put the continued development of that popular general-purpose analytics engine on display, as Spark originator Databricks detailed updates in the works for Spark 2.0. Companies, including IBM, Microsoft and others, were also on hand to fuel the fire with Spark-related offerings.
Apache Spark 2.0 will be made generally available later this month, according to Matei Zaharia, CTO at Databricks, based in San Francisco. While he admitted the software was still not stable, as the company similarly cautioned for a preview edition disclosed last month, he marked 2.0 as ''a good way to try things out.''
Zaharia, who created Spark as a grad student at the University of California, Berkeley, and forewent an MIT position to co-found Databricks, said Spark 2.0 includes over 2,000 patches from 280 contributors. Release elements he highlighted included coding technology that speeds Spark development and performance, SQL 2003 support and what he called structured streaming.
Higher reliance on code generation -- Databricks describes the new version of Spark's underlying engine as an intelligent compiler for Spark -- along with higher-level APIs will open up Spark to a larger group of developers, Zaharia said. That could be useful for beleaguered IT managers, as Spark skills continue to reap a high premium.
Zaharia pointed to a recent Stack Overflow survey with 13,540 U.S. respondents, which listed Spark as a top-paying technology skill for developers in the country. In the survey, Spark skills tied with Scala, a sister technology in that Spark is written in the Scala language, coming in at a $125,000 average per year.
In his conference keynote, Zaharia good-naturedly advised the mostly developer audience to apprise their managers of the survey results, while assuring managers present that Databricks was working to address the skills issue.
Make mine Lambda
Simplifying sometimes arduous data streaming development also represents a path to wider use of Spark.
With this Spark update, the software's committers have opted for a general-purpose approach to streaming -- at least, in part, to ease the transition for programmers not used to streaming. Spark maintains a mini-batch approach to streaming, something that has attracted some criticism for advocates of competitive approaches.
Streaming has often been cited as a favorable point for Spark, but it has faced competition from specialized streaming projects, such as Storm, Flink and Heron, some of which may offer lower performance latency than Spark.
Spark 2.0's structured streaming provides a single set of APIs that can be placed in the category of Lambda architecture, a term that has been used for designs that combine both batch and streaming.
This supports ''the most common use of streaming that we have seen," Zaharia said. "Most of the users we see integrate the different processing modes.'' He added that the new Spark Streaming APIs also bear greater resemblance to Spark's SQL APIs, again with the intent to make the Spark framework more accessible to a wider developer community.
While he cautioned wide industry use of data streaming is not yet in the offing, independent analyst and industry observer Thomas Dinsmore said Databricks' move to mix batch and real-time programming models will find a receptive audience.
"Spark structured streaming is an attempt to integrate streaming into a broader environment. And the fact is that no one ever derived insight from just a stream," he remarked.
"Typically, the stream is combined with historical trends data," he said. That is especially the case in important real-world applications, like credit card fraud detection, he added.
Dinsmore said the collection of elements in the latest version of Spark can position it successfully, even versus alternatives that are faster in terms of latency.
"Spark doesn't really have to be the best at everything. It just has to be good at everything," he said.
Spark fever spreads
Meanwhile, Apache Spark continues to appear as part of other vendors' offerings. The providers range from mainline traditional providers to startups.
At the summit, Microsoft formally rolled out the Spark for Azure HDInsight platform, which it built together with Hortonworks. Microsoft also made generally available R Server for HDInsight versions that run on the cloud or on premises. The company obtained R Server along with its 2015 purchase of R-language specialist Revolution Analytics.
One of Microsoft's goals is to broaden the ranks of Spark developers by improving R support on the distributed Spark platform. Another is to speed computation.
"By combining R with Spark, we give data scientists the familiarity of R, while allowing them to run their code with the scalability of Spark," said Oliver Chiu, senior product marketing manager at Microsoft. While noting performance can vary according to workload, he said R Server on Spark can enable faster training of machine learning models.
"Microsoft taking its R Server products and rewiring them for Spark is a good move," Dinsmore said. "They have created a high-performance platform."
R integrations have struggled somewhat when applied to Hadoop due to slow performance of its MapReduce component, and the move to support Spark should help here, according to Dinsmore, who formerly served as director of product management for Revolution Analytics. This mirrors a widely held view that sees Spark as a performance improvement over Hadoop's original data processing engine, MapReduce.
Are you data science-experienced?
Also this week, IBM announced a development environment for Apache Spark. Running on the IBM Bluemix cloud platform, it targets the needs of data scientists, especially those who work with the R programming language. Known as the Data Science Experience, the collaborative service will center on SparkR, Spark SQL and Spark ML tool sets.
The R audience needs more tools in order for the Spark developer ranks to broaden, according to Rod Thomas, vice president of product development for IBM's analytics group.
"We want to make it easier for data scientists to build R models, and then to run them on Spark," he said. R is a significant tool in the data scientist community today, but its use alongside Spark has been challenging, he noted. "So far, R has not been treated as a first-class citizen in Spark," Thomas said.
In other Spark Summit news, longtime Hadoop proponent MapR Technologies released a version of its Converged Data Platform that is Spark-only. That is, the package uses YARN, but otherwise strips off other Hadoop ecosystem components. As well, NoSQL software house Redis Labs said it had produced a connector that integrates its Redis Cloud with Databricks' Spark service.
Look inside Microsoft's Azure-based Hadoop implementation
Learn about IBM's big Spark bet