Joshua Resnick - Fotolia

MapR adds 'Streams' messaging to its Hadoop data pipeline

MapR's Hadoop distribution will add a message system to feed a streaming data pipeline. It takes a cue from open-source Kafka technology.

As the advance of big data grows more pronounced, new data pipeline tools are arriving to handle the surge, with MapR Technologies Inc.'s MapR Streams among the latest entries. The company said the software can forward billions of events per second.

Due for release in Q1 of 2016, this publish-and-subscribe messaging broker funnels data into streaming processors. The product rides a wave started by another product; it shares some APIs with the increasingly popular open-source Apache Kafka message broker.

There is a difference between Kafka and MapR Streams, however, in that the MapR software will include its brand of high-availability and data recovery characteristics. As well, the company said, MapR Streams can be tightly integrated as part of data pipelines running on its MapR distribution of the Hadoop data framework, which is now known as the MapR Converged Data Platform.

Among an assortment of Apache Hadoop ecosystem software for processing streaming data are Apache Storm, Apache Flink, Apache Spark and Apache NiFi. These tools can be fed by messaging queues such as Kafka and MapR Streams.

MapR said MapR Streams works with Spark, Storm and Flink. But, although it shares programmer APIs with Kafka, it is not open source software per se. Other vendors have also developed Kafka-derived products. For example, Kafka's inventors, who came up with the system while working at social network LinkedIn, subsequently formed Confluent, which released a 2.0 version of its commercial offering based on the Kafka core just last month.

Data variety and velocity

A MapR platform user at Chicago, Ill.-based Valence Health sees value in effective data pipeline tooling, and not just for fast data handling. The company began using MapR's Hadoop distribution about a year ago to deal with numerous inbound data feeds related to patient health records, immunizations, pharmacy benefits and other data types. 

The challenge was as much about data variety as it was about velocity, according to Dan Blake, Valence Health CTO.  "We wanted to wire all the [data elements] together so things naturally flow through it all," he said.

Blake said his teams have been working on improved data ingestion processes, had interest in Kafka, and "will be evaluating MapR Streams." The 30-plus year technology veteran found favor with MapR's use of Kafka APIs, and its integration plans with the Converged Data Platform.  The approach calls for handling messaging and stream processing of multiple workloads on the same Hadoop clusters.

Stream on any cluster

The approach MapR has taken with the new software somewhat mirrors the approach it took with the Hadoop Distributed File System, where it took the core open-source software and customized it for high-availability enterprise operations.  The company points out that MapR Streams installs directly on Hadoop clusters.

Analyst Robin Bloor emphasized that MapR Streams brings a new type of infrastructure to the big data scene, because it allows streaming on any cluster. Kafka is often used separately for such data handling today, requiring separate configuration, he said.

"You can do the same thing in Kafka, but this [MapR Streams] configuration is neater," he continued. "They allow you to use streaming on any [MapR Hadoop] cluster."

MapR has "bolted this in right at the foundation of the file system," said Bloor, who is chief analyst at The Bloor Group. He said the enhancement better enables MapR users to implement lambda architectures. Those are schemes that use the same data pipeline to support both batch analytics and real-time operations.

Big data onslaught continues

Data processing applications have transformed in recent years to handle the swell of data driven in large part by mobile, Web and cloud computing data. That's expected to continue, with worldwide growth of data center traffic running at 25% (CAGR) through 2019, as estimated by Cisco's  Global Cloud Index.

That big data onslaught has already led to a slew of new data processing and streaming frameworks.  As products like MapR's appear, it seems message systems that feed the streamers may be in for changes too. 

Still, some caution is in order. Veteran technologist Dan Blake has seen the appearance of many new technologies. "There's a new silver bullet all the time," he said.

But, he added, there is generally some bit of value in many of these new "silver bullet" technologies. "At least they provide a vocabulary that is useful for discussion," he said.

Next Steps

Discover the uses of Spark data streaming.

Find out about streaming data pipelines and analytics.

Learn where stream processing may not be a fit.

Dig Deeper on Hadoop framework