Evaluate Weigh the pros and cons of technologies, products and projects you are considering.

Users jump some hurdles to do streaming, machine learning with Spark

Listen to this podcast

In this Talking Data podcast, Spark users are finding that latency and development challenges can make it difficult to start doing machine learning with Spark systems.

Apache Spark may be maturing as a data-processing platform, but it still has a long way to go. Despite its usefulness in big data analytics and the ease of implementing machine learning with Spark, users are still running into problems, some of which were brought up at last month's Spark Summit East 2017 in Boston -- the subject of this edition of the Talking Data podcast.

One issue voiced at the conference is system latency on data streaming in Spark. The microbatching architecture in its Spark Streaming module has been a source of dissatisfaction for some users, because Spark isn't a pure real-time processing engine. While this may not always be a problem, some vendors and users are staunchly in support of Apache Flink, a genuine streaming platform that has much lower latency, according to its proponents. Flink hasn't reached the level of popularity that Spark maintains, but it is, nevertheless, a viable competitor for the more prominent processing technology.

In a keynote, Ion Stoica, executive chairman of Spark vendor and conference organizer Databricks Inc., discussed how the Drizzle framework, which is expected to be added to Spark Streaming sometime this year, will help to reduce latency in the program. Created by RISELab at the University of California, Berkeley, Drizzle is predicted to help speed up machine learning in Spark and make its streaming performance more efficient overall.

Spark users have also run into difficulties in implementing the streaming programs that feed machine learning applications. Downtime with streaming data can result in major issues in analytics efforts. Programmers are still experimenting, learning how to set up microbatch intervals and DataFrames that group distributed data sets together in Spark; for many, using it requires a lot of training and experience.

Despite these issues, more enterprises are employing Spark for stream processing, as well as starting to utilize machine learning with Spark in their operations. The evolution of the processing engine is making this easier, enabling those businesses to predict trends in a variety of business areas.

Listen to the podcast to hear about Spark, Drizzle, Flink, machine learning and more from Spark Summit East 2017.

Next Steps

Read more from the Spark Summit on how Drizzle will change Spark Streaming

Why Spark is getting a central role in big data environments

Learn how machine learning can make a big difference for businesses