Innovation in Spark Streaming architecture continued apace last week as Spark originator Databricks discussed an upcoming add-on expected to reduce streaming latency.
Based on ongoing work by a lab at the University of California, Berkeley, elements of what is being called the Drizzle framework are expected to become part of Apache Spark later this year, according to the company.
The anticipated streaming update is part of Databricks' larger efforts to provide a platform for broad new analytics uses. Drizzle is intended to help promote users' moves to so-called Lambda architectures that combine batch and real-time data processing approaches.
Spark trending now at Netflix
The move to embrace both batch and real-time processing isn't an easy one, even for fast-flying web companies. But it is a natural step, according to Shriya Arora, a senior data engineer at Netflix.
Arora is part of a Netflix team that employs Spark processing and streaming to transform and push data to data scientists who develop algorithms that personalize the company's movie recommendations to subscribers. As Netflix converts some applications from batch to real time, she's working to fine-tune Spark Streaming to ensure there are monitoring alerts that warn when streaming jobs may fail.
"Streaming is better than having long-running jobs, but it comes at a cost. For example, streaming failures have to be addressed immediately. If an application is down too long, you run into data loss," she told an audience at last week's Spark Summit East 2017 in Boston.
Real time means 'why wait?'
The real-time effort is worthwhile, however, because it can better align Netflix's movie recommendations with the immediate interests of customers. "Trending now" viewing choices, for example, can be more completely up to date, Arora said. "Why wait 24 hours when you can pick up the new information in an hour?"
But the Spark Streaming architecture today doesn't support pure event streaming -- it still has roots in a "micro-batching" formula that rapidly processes small batches of data. So, there are cases where time-sensitive applications might better opt for streaming as supported by alternative frameworks such as Flink or Storm, Arora said.
Such use cases are a prime target for Drizzle, a project within the UC Berkeley RISELab -- itself a descendent of the AMPLab project that begat Apache Spark. [RISE stands for Real-time Intelligence with Secure Execution.]
Drizzle's goal is to unify record-at-a-time streaming with micro-batch models, and is in some part an answer to Flink, an emerging streaming architecture that has shown performance benefits over present Spark Streaming.
Hearing Flink steps?
As he discussed Drizzle in a Spark Summit keynote, Ion Stoica didn't try to cover up Spark Streaming architecture's present latency shortcomings in streaming versus Apache Flink. He said Drizzle is intended to reduce Spark Streaming's performance latency by about 10 times.
Stoica is executive chairman and a co-founder of Databricks, and is also a professor of computer science at UC Berkeley and a part of the RISELab. In graphs, he showed Spark trailing Apache Flink by hundreds of milliseconds in handling event throughput.
He also showed data in which early versions of Drizzle and a companion Drizzle-Opt execution engine slightly improve upon present Apache Flink performance. While details were sparse, Drizzle architecture as depicted on the RISELab's website is meant to "decouple execution granularity from coordination granularity" for workloads on clusters.
In an interview, Spark inventor Matei Zaharia, who is CTO at Databricks and another co-founder -- as well as Stoica's former grad student -- said parts of Drizzle would likely appear in Apache Spark during the third quarter of 2017.
Pursuing a unified model
Both Stoica and Zaharia emphasized that recent advances in streaming technology for Spark, including a Structured Streaming engine and API added as part of Spark 2.0 last year, have focused on enabling a more cohesive approach for programmers that combine real-time and batch data processing on a single platform. They positioned Spark overall as a unified approach to diverse data management and analytical needs that include ETL, machine learning and SQL querying, as well as streaming.
"We think of Spark as the infrastructure for machine learning, which itself is really a small part of the entire workflow," Stoica said. "You have to clean the data, and transform it. Then, at the end, when it is curated, you apply machine learning algorithms on top."
This unified approach has merit, according to a machine learning user at a marketing analytics firm who attended the Boston event.
"Previous to our use of Spark, we had ETL, machine learning and other analytics processes, and they were all on different software stacks," said Saket Mengle, senior principal data scientist at Boston-based DataXu Inc. "Spark allows us to put this on one stack. It is something you have to tweak, but uniformity is good."
Spark in context
Improvements to Spark Streaming should be viewed in the context of Spark's overall analytical adoption, said one industry analyst on hand at the conference.
"Spark's long-term appeal has been as an ensemble of analytical approaches, and its ability to address a variety of workloads," said Doug Henschen, a principal analyst at Constellation Research Inc.
In a blog post following the conference, Henschen remarked that Spark was progressing more quickly than was predecessor Hadoop at a comparable stage of development, and that it promises "wider hands-on use" by a variety of developers and data scientists.
One measure of Spark's progress is its adoption by vendors beyond Databricks, he said. In fact, the open source version, Apache Spark, is offered by traditional enterprise players like IBM and Oracle, as well as Hadoop distribution providers Cloudera, Hortonworks and MapR.
It's noteworthy, too, that Spark is offered on the cloud by the likes of Amazon, Google, Microsoft and others. So far, Databricks has focused its efforts on providing cloud services, which is where its new approach to streaming will likely first be tested.
See how Apache Spark looks to improve on MapReduce performance
Learn how Spark and Hadoop compare
Find out more about Apache Spark