While the volume of data today often gets a lot of fanfare, the velocity of that data -- how quickly it's generated, captured and handled -- continues to gain attention, as seen at Strata + Hadoop World 2016 in San Jose, Calif., where streaming data tools and technologies were prominent.
The event showed that Hadoop-based big data implementations increasingly include the Apache Kafka messaging system and the Spark data processing engine's Spark Streaming module. Both technologies have become more prevalent, as developers build data pipelines that go beyond original batch-oriented Hadoop designs and approach real-time streaming analytics capabilities.
Kafka and Spark Streaming are often used together, with the former acting as a publish-and-subscribe messaging queue that feeds the latter. Once in Spark, the streamed data is then processed in parallel, sometimes for use in automated analytics applications. And in the bountiful Hadoop ecosystem, there also are other emerging contenders, including open source technologies such as Storm, Samza and Flink.
Streaming's impact on the overall big data space could be notable. Market researcher Wikibon, for example, recently predicted by 2022, the global market for unified streaming analytics technology will be 16% of all big data spending, or about $11.5 billion.
Feeding the data pipeline
The streams Spark can handle may not be as swift as those in, say, stock trading applications, which have long been a hotbed of submillisecond data streaming. But for Internet clickstreams and other common data flows, Spark Streaming's microbatching architecture may be enough to meet stream processing needs, according to some Strata + Hadoop World attendees.
"For 90% of what people want to do, Spark Streaming will fit the purpose," said Mohammad Quraishi, a senior IT principal for big data analytics at medical insurer Cigna Corp., in Bloomfield, Conn.
Cigna uses Spark Streaming along with Kafka as part of a larger Hadoop system, based on Cloudera's distribution of the big data framework. Quraishi, who spoke at the conference, said Kafka enables the insurer to create a "speed layer" for data. "It completes the Lambda architecture for us," he said, referring to an often-discussed architectural approach for managing big data that supports both batch and real-time processing.
"A data pipeline is really important. Having a high-throughput, low-latency pub-sub messaging engine like Kafka will help simplify how you handle data beyond that," Quraishi said.
The Kafka and Spark Streaming combo appeared in other conference presentations, as well. Applications ranged from a real-time fraud detection system created at Netherlands-based financial services company ING Group to a streaming system for handling sensor data transmitted from railroad cars that's in progress at industrial conglomerate Siemens AG, which is based in Munich, Germany.
According to Yvonne Quacken, a senior big data architect and engineer at Siemens, who also spoke at the Strata conference, the need for fast, easy and flexible data streams is high. "Today, we lose time just trying to load data," she said.
"In general, we're hoping to optimize our processes. What we're doing now will enable more in the future," said Quacken, whose team is working with data warehouse and big data vendor Teradata to connect Kafka, Spark and other open source software to handle incoming data from the Internet of Things.
What's driving the data train?
What's driving innovation in streaming data analytics today is the generally faster cadence of digital business overall, according to independent analyst and industry observer Thomas Dinsmore. That cadence is often driven by the growth of Web and mobile applications, Dinsmore said.
He noted that streaming analytics has a long lineage dating back to the development of complex event processing (CEP) technologies by Tibco Software and other vendors. But CEP systems were fairly expensive to implement, Dinsmore said, "so penetration was limited to the strongest use cases, like high-velocity trading and capital markets applications."
Dinsmore said the new generation of open source data streaming tools -- he cited Apex, Flink, Samza, Spark Streaming and Storm among notable frameworks -- "offers the potential to lower costs dramatically, which opens new use cases."
In the move to Hadoop-oriented data streaming, the pairing of Spark Streaming and Kafka has a head start. But if recent Hadoop history is a gauge, a number of data streaming frameworks are likely to vie for data managers' attention, as the pace of development on the technology quickens.
Look at some applications in which big data streaming makes sense
Catch up with Kafka in a Q&A with co-creator Jay Kreps
Read about the role streaming plays in data lakes