This content is part of the Essential Guide: Using big data platforms for data management, access and analytics

Functionality gaps not stopping Spark usage from growing fast

Organizations aren't letting ongoing Apache Spark development, functionality holes or issues deter them from ramping up usage of the technology. Find out why.

Spark usage is growing rapidly, even though the data processing engine still has some growing of its own to do.

For example, software vendor Xactly Corp. is using Spark to run a mix of batch and real-time applications. While Spark's fast performance makes it a valuable processing tool, the big data technology still has some rough edges that need to be smoothed out, in the eyes of Ron Rasmussen, the company's CTO and senior vice president of engineering.

"It's not immature," Rasmussen said. "But when you compare it to running transaction-level Oracle, it's not there yet." For example, Xactly has had to troubleshoot idiosyncrasies in memory usage, sometimes turning to support technicians at Hadoop vendor MapR Technologies Inc. for help. And monitoring Spark queries is something of a guessing game for Rasmussen's team. "It's hard to know if something is supposed to be taking that long," he said.

Increasingly, Spark is pushing aside MapReduce, Hadoop's original programming environment and execution engine, for batch-processing uses due to its performance advantages. But Spark doesn't fully measure up even to MapReduce on some types of functionality, said Nitin Kak, a lead software development engineer at online marketing analytics platform vendor Quaero. In working with Spark, Kak has had to manually provision the amount of memory and the number of CPU cores required by processing jobs, things that he said MapReduce can take care of automatically.

For a very algorithmic, mathematical, computation-heavy workload, Spark works very well.
Dan Smithvice president of platform development, Quaero

Nonetheless, Quaero, based in Charlotte, N.C., has built a Spark-based identity resolution engine to find matching data elements in clickstream records, online transactions and other web activities in order to pinpoint individual consumers for website personalization and targeted marketing. One of the resolution engine's two batch application modules was developed for Spark from the outset; the other was initially written in MapReduce when Quaero put the program into production in late 2015, but it was redone to run in Spark in a second release launched last spring.

On the plus side, Kak said it was much faster to write Spark code in the Scala programming language than it was to program in MapReduce. Spark users can also persist data in memory "and keep working on it again and again," he added.

Performance-wise, rewriting the second module cut the processing time on a trial run against about 500 million data records stored in a cluster based on Cloudera Inc.'s Hadoop distribution from five hours in MapReduce to 90 minutes with Spark. The identity resolution jobs involve "a lot of repetitive algorithms, data matching iterations and looping," said Dan Smith, vice president of platform development at Quaero. "For a very algorithmic, mathematical, computation-heavy workload, Spark works very well."

Looking past Spark's missing pieces

Xactly, which sells cloud software for managing incentive-based compensation programs, also isn't waiting on the processing engine to mature further before ramping up its Spark usage.

The San Jose, Calif., company put a pair of Spark-based applications into production use in October 2015: a batch one that assigns credit for orders to sales team members who deserve a cut of the commission, and a real-time tool that dynamically pulls together customized views of employee payout data for compensation managers. It added a third application last spring: a batch extract, transform and load (ETL) job for aggregating sales compensation data from clients and feeding it into an Oracle relational database for analysis and reporting.

Rasmussen said Xactly runs its transaction systems on Oracle but is using the combination of Spark and a MapR-based Hadoop cluster to offload some of the required batch processing to a lower-cost platform. The subscription cost for using the big data technologies is "a fraction" of the annual support fee that the company has to pay on the Oracle software, he said.

Today, it's pretty stable for us. We don't see a lot of failures -- or if there is a failure, we know how to fix it.
Grega Kespretengineering director for analytics, Celtra Inc.

In addition, the credit assignment application now scales linearly with the number of sales orders being processed; in the Oracle system, performance bogged down as data volume increased. "What was taking hours or didn't complete now takes minutes," Rasmussen said, adding that the credit records are then sent back to the Oracle system for transactional uses.

The payout application, on the other hand, is self-contained in the Hadoop and Spark system. It creates what's known as a payout curve, which plots sales quota and commission amounts for an organization's sales reps -- data that can be highly variable because of the different incentives and commission rate tiers built into compensation programs. Rasmussen said his team uses MapReduce to aggregate benchmark data on payouts in Hadoop's companion HBase database. But Spark calculates payout curves on the fly, based on filters set by compensation managers when they kick off queries.

Given the rapid clip at which Spark is being updated, the functionality holes and technical issues faced by users can vary depending on which release they're running -- and that, in turn, may depend on whether they're getting the technology from a vendor, and, if so, which one. The vendors that support Spark aren't always in lockstep on which release they offer, which can leave some users waiting for technical improvements that other organizations are already taking advantage of.

Putting Spark maturity into perspective

Tony Baer, an Ovum analyst based in New York, said some perspective is warranted on the maturity front because of Spark's relative newness. The processing engine was created in 2009 and open sourced the following year, but it wasn't set up as an Apache Software Foundation project until mid-2013. "You have to keep in mind that when we talk about all the shortcomings, this technology is really less than five years old," Baer said. "The speed at which it's being enhanced is pretty amazing."

Celtra Inc., a Boston-based company that offers a platform for designing online display and video ads, was one of the earliest adopters of Spark. Its Spark usage began at version 0.5 of the base open source software, which was released in 2012 prior to Apache's involvement. Celtra then was among the first beta users of lead developer Databricks Inc.'s cloud-based Spark implementation, putting it into production use in January 2015.

Grega KespretGrega Kespret

Spark is "a much different technology," both on functionality and stability, than it was in the early days, said Grega Kespret, Celtra's engineering director for analytics .

"We had a lot of problems in the beginning with Spark -- we spent quite a lot of time debugging and tuning it," Kespret said, pointing in particular to out-of-memory errors on processing jobs. "Today, it's pretty stable for us. We don't see a lot of failures -- or if there is a failure, we know how to fix it." The technical documentation that's available on Spark has also been greatly improved, he added.

Celtra continues to use both open source Spark and the Databricks version -- the former to do ETL conversions on data, and the latter as the primary analytics platform for the company's data analysts. The architecture, built in the Amazon Web Services cloud, doesn't include Hadoop. Instead, Kespret said, the Spark ETL jobs funnel more than 2 billion data points captured daily on ad interactions and other trackable events between the Amazon Simple Storage Service, a MySQL operational data store and a cloud data warehouse from Snowflake Computing Inc. that Celtra deployed in early 2016.

Spark becomes a target itself

While it isn't fully mature, Spark has reached the stage where newer open source technologies are being created as potential alternatives to it, much like its developers sought to replace MapReduce. For example, Apache Flink is a stream processing engine that originated in Germany; Flink's backers claim it can process heavy-duty data streams faster than Spark, which uses a microbatching approach to streaming in which small batches of data are grouped together and processed in rapid succession.

Independent consultant Thomas Dinsmore said that unlike Spark, Flink was built from the ground up to support pure event-based streaming. "But microbatching ain't bad, and the reality is that there are very few analytics applications that require less than the half-a-second latency you can get with Spark," Dinsmore said. He said that, overall, he thinks Spark "has a lot of legs" as a big data processing and analytics platform for a mix of stream processing, machine learning, ETL and other uses going forward.

Novantas Inc. certainly has big plans for Spark usage. The New York company, which provides analytics services and tools to financial institutions, is using a Cloudera-based Hadoop and Spark system to run an application called MetricScape that it initially built for one bank in early 2016. Kaushik Deka, CTO and director of engineering for its Novantas Solutions technology unit, said the application acts as a librarian of sorts for customer and financial data metrics, providing a governance layer that tracks things such as data lineage, definitions and dependencies.

The idea, Deka said, is to help data scientists pull together relevant data sets for analysis. In the case of the initial user, that involved looking at customer account histories, the results of previous marketing campaigns and other data to segment millions of bank customers based on their likely responsiveness to planned promotional offers. Spark does the batch ETL processing that creates an underlying data model and partitioned data sets for MetricScape users -- Deka said Novantas went with it over MapReduce because of Spark's faster performance and support for the Scala and Python languages.

Novantas is also working conceptually on a second application that would put Hadoop and Spark at the heart of an automated rules engine aimed at providing bank employees with analytics information in real time. For example, bank managers dealing with customers looking to negotiate reduced mortgage rates could get on-the-fly rate recommendations based on the customers' overall relationship with the bank, Deka said.

Spark needs some further development to be able to handle that kind of processing across thousands of bank branches, he added. He's confident, though, that the processing engine will get there. "I'm not sure it's ready for that use case now -- but I think it will be."

Next Steps

Q&A: What's ahead for Apache Spark?

Take our Apache Spark quiz

Is it Spark vs. Hadoop or Spark and Hadoop?

Dig Deeper on Big data management