After spending several years in the shadow of Hadoop, the Apache Spark data processing engine is stepping out into the light as a tool for use in big data architectures. And last week, the open source software reached a landmark when the Apache Software Foundation released Version 1.0 of the technology, which followed Spark's February ascent to top-level project status at Apache after eight months at the incubator level.
"One-oh means the Spark project, within a short amount of time, has accomplished some major milestones," said Chris Mattmann, a chief architect at NASA's Jet Propulsion Laboratory in Pasadena, California, and a member of the Apache board of directors. "Spark has a great trajectory."
Spark is aimed at providing faster cluster-based processing than MapReduce, the parallel processing engine and software programming framework that drove the initial development of Hadoop applications. The newer software is designed to do more work in-memory, enabling it to better support interactive queries, as well as iterative and real-time processing, according to Mattmann.
Spark's developers have worked to ensure it is compatible with the Hadoop Distributed File System (HDFS) and other Hadoop data repositories, such as NoSQL databases HBase and Cassandra. The technology can run in Hadoop systems on top of YARN, the cluster resource manager introduced in Hadoop 2 to add support for non-MapReduce applications.
But Mattmann noted it can run separately from Hadoop, as well. Furthering that capability, Spark 1.0 adds a Spark SQL component to support schema-based data modeling with the SQL programming language, a move that seeks to meet a growing need for querying structured and unstructured data results garnered through massively parallel applications.
Spark wants to take you higher
A big part of the Spark push, which originated at the University of California at Berkeley, has revolved around support for a higher level of Java programming, one meant to shield developers from the complexity of MapReduce-oriented parallel programming.
For his part, Mattmann credits Spark for ''doing well as a low-latency environment." He said he and teams of data scientists at NASA are working with satellite data to monitor and analyze snowpacks in the western U.S. and to do climate modeling and assessment. The analytics results are intended to be used as part of water resource planning activities by the National Integrated Drought Information System program and other government entities. Mattmann said fast turnaround rates are needed to aid drought preparation and response efforts, and he thinks Spark can help.
"We have to deliver tens of terabytes of data in 24 hours, and Spark is a real advancement there," he said. "Vanilla Hadoop and MapReduce were very I/O-oriented. They allowed you to scale, but they were not amenable to real-time activity."
Curt Monash, president of analyst company Monash Research, also faulted the real-time performance of MapReduce, which is geared to batch processing. He said there's promise in Spark as a "next-generation parallelization paradigm." Its ability to successfully work iteratively on problems also makes it a good candidate for machine-learning tasks, according to Monash.
Seeking a Spark for stream-processing apps
Streaming and event processing have also been cited as Spark use cases. But Storm-on-YARN and other parallel processing approaches will vie there, too. While crediting Spark for some uses, one long-time Hadoop hand who now heads a streaming technology startup suggested that Spark may lag in this application type.
"As a Cal alumni, I think Spark is a good thing," said Phu Hoang , co-founder and CEO of DataTorrent Inc. "But a lot of what is going on with Spark is people trying to speed up MapReduce. The main appeal for Spark has been to do MapReduce in memory." He described Spark's processing tack as a "mini batch approach" and contended that the technology's latency, while an improvement on MapReduce's, may not be sufficient for future big data streaming and event processing jobs.
At DataTorrent, Hoang and his colleagues are pursuing streaming strategies based on a combination of home-brewed Java operators, YARN and HDFS. To that end, the company this week rolled out its DataTorrent Real-Time Streaming software for Hadoop 2 systems at the 2014 Hadoop Summit in San Jose, California.
But while Apache Spark is still early in its lifecycle, the software has gained a list of adherents that may ensure it gets wide consideration. For example, it has been used at IBM, Intel, Yahoo and China-based e-commerce company Alibaba. Hadoop distribution providers Cloudera and MapR Technologies have also both shown support via alliances with Databricks, a startup spearheaded by a team including Chief Technology Officer Matei Zaharia, one of the originators of the Spark effort at U.C. Berkeley in 2009. Databricks has forged a similar alliance with DataStax, which offers a commercial version of Cassandra.
The world of big data architectures gets no easier to sort through as the flourish of new software continues. With its 1.0 release now available, Spark seems poised to gain added attention in weeks and months to come, during which its pros and cons will be tested.