This content is part of the Conference Coverage: Strata + Hadoop World 2016: Hadoop and Spark in spotlight

NoSQL revs up to the tune of the Spark connector

Attention's been placed on Spark running on Hadoop, but there are Spark connectors for NoSQL that usher in a new class of operational analytics.

Online transaction processing technology was in flux in recent years, as NoSQL databases arose to deal with massive data scaling. Meanwhile, classic analytical schemes were disrupted, as Hadoop and Spark emerged.

Now, applications are coming forward that exploit both of these wunderkinds to create near real-time analytics atop large-scale transactional systems. Particularly, Spark connectors to NoSQL systems are becoming increasingly prevalent.

"There is a changing landscape. We are changing what we are doing with operational systems and analytical systems," analyst Mike Ferguson said last week during a webinar staged by Dataversity.

Ferguson, who is managing director of U.K.-based Intelligent Business Strategies Ltd., discussed linking Spark with the key-value Riak NoSQL data store from Basho Technologies as a means to speed up analytics on Web and mobile data.

Like competitors Aerospike, Couchbase, DataStax, Redis Labs and others, Basho is fielding a Spark connector for this purpose.

Operational analytics

Ferguson said combining NoSQL with Hadoop and Spark lays the groundwork for something he called "operational analytics" -- that is, systems that are less about running overnight batch jobs than were past analytical systems.

For many years, relational databases were on both ends of the loop between operational and analytical systems, Ferguson said. But that has started to change as Web and mobile applications start to require the type of scalability that is achieved running distributed computer clusters. For data processing, the Web was one thing -- mobile was another.

Spark connectors to NoSQL systems are becoming increasingly prevalent.

With mobile access to transactional systems, "the number of concurrent users rocketed to completely unprecedented levels," he said. That has led teams to try out new architectures, and Spark has emerged as a contender in that race.

Like others, Ferguson pointed out that Spark, while it often runs on Hadoop, is not limited to Hadoop storage. "It can access relational data stores and NoSQL data stores," he said.

That, in turn, supports Spark employment in a range of analytics, "some of which are operational analytics," he noted.

As Ferguson described it, operational analytics tries to leverage analytics to prevent events or to optimize processes. That can lead to applications that mitigate risk, improve customer interactions or reduce unplanned operational costs, Ferguson said.

A/B test cases

An example of a NoSQL database working with the Spark analytical engine comes by way of Intuit, the provider of Web-based financial and tax preparation services.

According to Rekha Joshi, a software engineer at Intuit Inc., based in Mountain View, Calif., Spark has been implemented to perform analytics on data stored in the DataStax Cassandra database running on the Amazon Web Services cloud.

One use case she described focused on A/B testing of visitors' preferred interactions with the Intuit website. The goal is to get closer to the ability to both understand visitors' preferences and act automatically to personalize their Web views and site interactions.

Millions of site users create a lot of data, she remarked, and the Cassandra NoSQL database has the capability to keep up with the influx. But, she continued, "Cassandra is not meant for number crunching. You have either Spark or Hadoop for that."

In the background, the Intuit crew works with both Spark and Hadoop. "They both have their performance [strengths] and their limitations," she said. The work requiring closer-to-real-time performance goes to Spark.

"Hadoop is a batch processing system. Spark is there for real time, or near real time," she said.

The path Joshi described bears some kinship to Ferguson's operational analytics, but she chose another term used these days to describe architectures that support both batch and real-time analytics; that is, lambda architecture.

Analyses like Ferguson's and use cases like Joshi's give a view into changes in architectures for data analytics. There could well be more in store.

For NoSQL, there is considerable growth to come, according to Allied Market Research, which estimated that a global NoSQL market that barely existed 10 years ago will reach $4.2 billion by 2020.

Like Ferguson, Allied cited Web, mobile and e-commerce applications as NoSQL growth drivers. Connections to new analytics engines like Spark could contribute to NoSQL growth as well, broadening the available uses for NoSQL. 

Ed Burns, site editor for SearchBusinessAnalytics, also contributed to this story.

Next Steps

Learn more about managing distributed data projects

Find out how Spark is going to market

Discover the role of Spark connectors in Hadoop data lakes

Dig Deeper on Hadoop framework