The Apache Spark distributed data processing framework is being prepped for another step forward.
Details of the 2.0 version of the software, disclosed at Spark Summit East 2016 in New York, indicated that the next open source Spark revision will include improvements to Spark Streaming, the technology's stream processing module.
Data streaming has gained increasing interest of late, as ever-larger amounts of Web and mobile data arrive in organizations, and more applications focus on keeping that big data in motion.
Also on tap with Spark 2.0 are some merged APIs and a general boost in the core Spark system's performance. Updates to its SQL query capabilities and machine learning APIs are due, too.
But the streaming updates hold particular importance for an emerging class of distributed processing that is based on what is described as a Lambda architecture. In Lambda schemes, offline batch processing pipelines coexist with real-time processing pipelines for data analytics.
"Analysts don't want to wait 24 to 48 hours to work on data sets any longer," said Seshu Adunuthula, head of analytics infrastructure at eBay Inc., based in San Jose, Calif. "If you look at the people building out new platforms, say the Ubers and the Spotifys of the world, they are not building traditional batch pipelines at all."
He said eBay's extract, transform and load (ETL) jobs were benefiting from the use of the core Spark engine. Adunuthula said he expected to research the new features of Spark Streaming, along with a Google streaming data alternative, known as Dataflow. He said eBay has done some data streaming with Apache Storm, a Hadoop 2.0 component that has been overshadowed as Spark has hurtled ahead.
Like Adunuthula, Suren Nathan, senior director of big data analytics at Synchronoss, based in Bridgewater, N.J., said he sees data streaming and Lambda architecture as central to next-generation analytics development. He said the work he oversees is currently focused on version 1.6 of Spark -- and his group's Spark Streaming work is "in the lab," but is expected to roll into production later this year. He said his team will be looking closely at Spark Streaming improvements.
"Streaming is the way to go," Nathan said. "People want to reduce time to action. It typically used to be 'next week' or 'next day.' Today, it is 'now.'"
What's on tap in Spark 2.0
As major Hadoop distribution providers added it to their portfolios, Spark has gained considerable attention as an alternative to MapReduce, Hadoop's original data processing engine for big data analytics. The shift gained momentum late last year when IBM pledged to use Spark in a wide range of analytical products.
Elements contributing to Spark's popularity are speed of analytics, machine learning libraries, SQL support and streaming, and updates to all these features are due with the next version.
Spark 2.0 builds on lessons learned in the past two years, according to Matei Zaharia, who created Spark as part of his academic work at the University of California at Berkeley, and who is CTO at summit sponsor Databricks, based in San Francisco. Although startup and traditional vendors alike have endorsed Spark, Databricks developers, such as Zaharia, remain the principal movers behind open source Apache Spark.
With the release of version 2.0, due in April or May 2016, a high-level API attached to a Spark SQL engine will enable easier development of event timing, Zaharia said. He called the overall approach "structured streaming," noting it was designed to support both batch and real-time approaches. Streaming in the first release of version 2.0 will focus on applications that use ETL jobs, and will be accompanied by updates to Spark's present machine learning APIs.
More parallel processing in store
Peter Crossley, director of product architecture at Webtrends, based in Portland, Ore., welcomes improvements to Spark Streaming capabilities. At Webtrends, where open source Spark software is used for Web data analytics, as much as 500 TB of data is being added each quarter to its big data cluster, making it a hotbed of data streaming.
Crossley noted that, at its core, the Spark Streaming architecture is still batch-oriented -- albeit in microbatches. Today, data architects work with Spark Streaming by defining "windows" of data to work on, so only a subset of the arriving data is handled at any one time. Handling data streams as subsets can be difficult, and Crossley is looking for improvements in Spark to help in this regard.
"We have to look ahead. With the next version of Spark, we will see more parallel processing," he said. The next version of Spark Streaming will start to provide views of incoming data that will be more immediate than those in present configurations, he suggested.
Streaming beyond batch jobs
Like others, Crossley suggested the era of so-called Lambda may give way to one in which real-time processing becomes more prevalent than batch. "We are getting closer to a situation where we won't have to do batch [processing] anymore," he said.
"There are a number of interesting tricks up the sleeve for Spark 2.0. Structured streaming is certainly one," said Tony Baer, a principal analyst at London-based Ovum. "The upcoming release further entrenches Spark as the compute engine of choice for integrating real time with batch processes."
Baer said other Spark 2.0 enhancements, such as Tungsten, an analytical engine rewrite that employs code generation to help address CPU bottlenecks, should also help to push Spark ahead. Meanwhile, he added, the general movement toward flexible open source analytics should be driven forward as well by other emerging technologies, such as Arrow, which will offer in-memory data persistence across different systems, and Alluxio, an endeavor formerly known as Tachyon that's aiming to create a common in-memory columnar data format.
Read about Databricks' innovative Spark platform
What's new and what's not in Spark Streaming and machine learning
Test your Spark acumen with this quick quiz