This content is part of the Conference Coverage: Strata + Hadoop World 2016: Hadoop and Spark in spotlight

Apache Spark architecture speeds data jobs, ousts MapReduce

Its collection of big-data processing features is priming the Apache Spark architecture for wider deployment. One key trait: Spark performance outpaces MapReduce in many Hadoop use cases.

Offering fast in-memory processing, high-level machine learning libraries and integrated data streaming capabilities, the open source Apache Spark architecture continues to find more adherents both in Web upstarts and traditional enterprise settings.

At the Databricks-organized Spark Summit East 2016 in New York, users shared their reasons for employing the Spark architecture, which melds several useful APIs with a basic in-memory analytics engine. Their experiences and that of others add weight to recent Market Research Media estimates that, globally, the Spark market could reach $4.2 billion by 2020.

Increasingly, Spark is at the heart of efforts to process data in motion, and fraud detection is one of the prime examples.

Chris D'Agostino, vice president of technology at Capital One, based in McLean, Va., told a summit crowd his team is using Spark to harden its defenses against financial fraud rings, even as the bank's digital applications create more and more digital data. The goal for Spark is to cut the gap between the time a series of frauds begin to occur, and the time the activity is identified and halted.

It starts with streaming

D'Agostino said Capital One has used Spark Streaming to combine large data sets for historical information, using both Spark's SQL interfaces and its graph data format.

"This is where Spark has been useful for us," D'Agostino said. "We can combine information in SQL and graph formats, and create execution models to make scoring decisions." The data is being fed to Spark's machine learning tools to help identify likely cases of false identity and bogus account sign-ups, according to D'Agostino.

He said Capital One uses a Databricks' supported connector to link Amazon Redshift data to Spark, allowing applications on the Amazon Web Services cloud to handle more data quickly, and to look at more varied features in the data -- ones that might uncover fraudsters.

D'Agostino said his team's effort with open source Apache Spark architecture components is part of a larger Capital One effort to model IT along Agile lines. Teams are organized for projects based on a "stack," which usually includes an enterprise architect, a data scientist and other analysts, user interface developers, and data engineers who handle the middle-tier and data infrastructure.

MapReduce diminished

Shops with long-standing Hadoop experience continue to bring in Spark -- in many cases, shifting work from Hadoop 1.0's original MapReduce processing engine to the newer format.

"Spark has been making many steady inroads" as a component in a data platform supporting analytics for e-commerce, said Seshu Adunuthula, head of analytics infrastructure at eBay Inc., based in San Jose, Calif.

Like other e-commerce sites, eBay is seeing a large shift to the use of mobile devices, even as it adds 8.8 million listings every week. At the same time, eBay sees a need for greater personalization of the website experience, which requires improved in-house analytics capabilities.

At the Spark conference, Adunuthula described a multiyear effort eBay has made to open up flexible analytics within the company based on increased use of Hadoop, now accompanied by increased use of Spark.

Spark processing outpaces MapReduce in some important use cases, Adunuthula indicated. He said eBay is transitioning "classic MapReduce jobs" that build multidimensional analytical cubes from MapReduce to Spark.

At your real-time bidding

Use of Spark at Boston-based DataXu Inc. comes on the heels of considerable use of Hadoop and its original MapReduce data processing engine for machine learning, said Beth Logan, senior director of optimization. DataXu is a Web company that enables markets for real-time bidding on online ads.

"Spark is faster," Logan said, echoing conclusions of others who have pitted MapReduce against Spark. Moreover, she said, Spark's ML Pipeline interface serves to automate iterative processing for machine learning in ways that MapReduce was not able to.

That is important, because DataXu data streams are daunting. Logan said the company's systems process as much as 1.6 million requests per second, while matching advertisers with available spot ads on the Web. Spark speed is one benefit -- there are others.

A big attraction of Spark for DataXu is its machine learning libraries, Logan said, noting DataXu's original MapReduce system relied on homegrown code to implement machine learning. Machine learning code for distributed deployment is difficult to write and to debug, and there is some comfort in being part of a larger community software effort, she indicated.

Preference for open source

Because Spark machine learning libraries are open source, she said, "We don't have to find every bug ourselves. That also means increased reliability. We have less of our own code to maintain."

Components such as its in-memory processing engine and high-level machine learning libraries have helped the Apache Spark architecture gain a unique place in current big data efforts -- but its open source traits are shared with other software from the Apache Hadoop ecosystem. That remains a chief factor in many big data analytics buying decisions.

"People who are making these decisions have a preference for open source," said Thomas Dinsmore, an independent consultant based in Newton, Mass. "In general, those people prefer open source products over commercial products."

Dinsmore advised that doesn't mean such users won't buy commercial products. It does mean, however, that there is a marked preference where possible for open source. That theme ran through many Spark discussions at the summit.

Next Steps

Listen to a podcast covering Spark's ascent in 2015

Find out how Spark machine learning helps a data preparation platform

Be there as Spark reaches version 1.0

Dig Deeper on Hadoop framework