photobank.kiev.ua - Fotolia
The marquee at New York's Javits Center proclaimed ''Strata + Hadoop World,'' but the conference organizers might well have added "Spark" to the top of the bill. The big data processing engine based on open-source Apache Spark was front and center at last month's event. Sessions on the Spark framework were well attended, and it was high on the agenda of numerous technology vendors, appearing in the latest releases from several Hadoop distribution providers and as part of analytical product rollouts from Alpine Data Labs, Cray Inc., Dell and others.
The Spark software was initially designed to be an alternative to MapReduce, the central processing component of first-generation Hadoop. Playing catch-up, Spark quickly moved from the labs of the University of California, Berkeley to top-level Apache project, and then to a Version 1.0 release earlier this year. Proponents claim that Spark can run Hadoop batch processing applications up to 100 times faster than MapReduce can.
But Spark can also run a variety of other applications beyond the batch-oriented ones that MapReduce supports. For example, it can be used in iterative, machine-learning applications, especially ones involving continuously updated streams of event data. Thus far, such uses have included music recommendation engines and genomics research. Spark incorporates an extensive library of machine-learning algorithms, as well as APIs supporting SQL queries, graph processing and general purpose data streaming.
That versatility has helped the Spark platform quickly gain wider attention, at least from vendors looking to take advantage of its big data processing capabilities. "The interest in Spark has to do with a different set of use cases than Hadoop has been used for, at least up until now," said Gartner Inc. analyst Merv Adrian.
Spark is also tailored for in-memory processing, which is meant to give it a big boost over disk-bound MapReduce. "We've seen Spark interest building for a while, just in terms of adding a new in-memory capability for high-performance in-memory processing," said Matthew Aslett, an analyst at the 451 Research. "The interesting thing beyond that is that it enables multiple approaches to analytics in a single in-memory engine."
Spark product iterations take new paths
At Strata + Hadoop World, various vendors made sure they had a Spark story to tell when meeting with prospective users. That builds on a recent surge in support for Spark from companies that offer Hadoop distributions, including Cloudera Inc., Hortonworks, MapR Technologies Inc. and Pivotal Software Inc. Hardware and software houses also drove the Spark bandwagon forward.
For example, Cray released its Urika-XA analytics system, supporting up to 48 compute nodes and pre-integrated with Apache Hadoop and Spark. Nano-scale material structure analysis is one early customer's area of interest, Cray said.
Meanwhile, Dell included the Spark framework in its In-Memory Appliance for Cloudera Enterprise, again supporting up to 48 compute nodes. The company said it has found early interest at a large retailer doing in-store RFID-based product tracking. And Alpine Data Labs introduced Alpine Chorus 5.0, an advanced analytics platform that includes an extensible framework to allow business users to build and manage Spark workflows.
"Spark provides a timely streaming type of analytics," said Eric Carr, vice president of the core systems group at Guavus Inc., a software vendor that has built an operational analytics platform for communications and marketing companies. Its Reflex 2.0 technology recently gained a Certified Spark Distribution designation from Databricks, another vendor that was founded by Apache Spark's originators. "Spark machine learning is all about processing in-memory iteratively," Carr said. "You could do that with Hadoop, but it's more challenging."
But Guavus uses Spark in conjunction with the Hadoop Distributed File System (HDFS), and Carr racks up some of the present interest in Spark to advances in the overall Hadoop architecture. YARN, the resource manager at the heart of Hadoop 2, "is a key enabler to make Spark possible," he said. With YARN, users can plug in Spark, Storm or other Hadoop-compatible technologies depending on the problem they're trying to solve, while still using HDFS as the underlying file system.
It's early on the Spark curve
A current drawback Carr cites is that Spark "still has some ways to go" to use SQL as a query language for probing data. That is often heard of Hadoop-related tools like Impala, Storm, Stinger and Tez, too. He also points to another weak point -- one that, again, is often mentioned in discussions of Hadoop itself: It's still an early stage in the software's development.
While generally agreeing that Spark is a fast-moving phenomenon, Aslett and Adrian see maturation issues as well.
Aslett said problems could result if organizations start using it for the wrong reasons, not giving appropriate applications enough consideration before moving forward. Like Carr, he also sees a possible double-edge to the in-memory blade. "If you use a lot in-memory, it's expensive," Aslett said. "People have to think carefully about the business cases and which technologies match up to them."
"Spark is going to need to grow up a bit -- it's brand new," Adrian said. He added, though, that things move quickly in the world of big data management and analytics: "In Hadoop years, it's already an adolescent."
Learn how to improve your big data skill set
Find out about effective big data project planning
Read about Ancestry.com's use of the Hadoop framework
Hadoop 3.0 deepens cloud and machine learning reach