Joshua Resnick - Fotolia

Kafka streaming gets a new twist

Startup vendor Confluent is looking to place a stake in the big data ecosystem with Kafka streaming and management tools meant to reduce complexity in applications that place data in motion.

Closely watched startup Confluent Inc. last week released Confluent Enterprise 3.0, which includes a management system for Apache Kafka clusters. The software is intended to provide a better view into the active pipelines of moving data that are becoming more prevalent as users seek to quickly ingest data into operational systems meant to process and analyze information in near real time. 

The company also has embarked on its own style of Kafka streaming, announcing open source Confluent Platform 3.0, a package that includes Kafka Streams, a stream processor that could in some applications vie as a replacement for Apache Spark Streaming. The latter is the Hadoop ecosystem stream processor with which the Kafka messaging system is often coupled.

Both Confluent's management and streaming tools are intended to reduce complexity in applications focused on data in motion, which is a new area for many data developers.

The Kafka management enhancements are intended to help teams uncover performance bottlenecks, without recourse to very low-level program inspection.

Meanwhile, the Kafka streaming product could simplify streaming by keeping more of the process in the Kafka environment, although Confluent representatives emphasize that they anticipate coexistence with other streaming frameworks such as Spark and Flink.

"Spark is a really good system for writing a lot of applications. For human genome data, or for a year of advertising data, that's great. It works in batch and in distributed mode. But you need to run it in its own cluster," according to Confluent's Joseph Adler, director of product management and data science.

Adler said Kafka Streams, which can work without YARN, the Hadoop 2.0 resource manager, could find traction in situations where existing applications are being updated with additional data streams -- as example streams he cited weather feeds or retail sales data.

Roots of Kafka

Confluent is very aware that a major use of the Kafka software is in stream processing. A company study released at its recent user conference showed that 68% of Kafka users surveyed plan to incorporate more stream processing over the next six to 12 months. Sponsored by Confluent, the survey covered more than 100 Kafka users from 20 countries. The growing role of Kafka in the enterprise was mirrored, according to the company, in the survey results; the company said 29% of respondents work for organizations with more than $1 billion in annual sales.

Kafka arose as part of web application efforts at LinkedIn. Kafka implements a log-oriented publish-and-subscribe messaging architecture, and is tuned specifically for distributed use. LinkedIn ceded Kafka as open source to the Apache Software Foundation in 2011. Confluent, which formed in late 2014 to commercialize Kafka tooling, is headed by individuals who played key roles in Kafka's creation at LinkedIn.

While Confluent finds itself in an increasingly crowded Hadoop ecosystem, its close connection with the roots of Kafka may provide an early advantage. Kafka streaming, like Kafka management tools, seems a natural step for the young company.

On the inside track?

"Like a lot of vendors, Confluent is trying to simplify streaming data pipelines, but it has the distinct advantage of being the supporting and guiding company behind Apache Kafka," said Doug Henschen, vice president and principal analyst at Constellation Research Inc. in Cupertino, Calif.

Confluent's inside track may be tested as increasingly numerous vendors adopt open source Apache Kafka in their product lines. Recent weeks have seen inclusion of Kafka in software rollouts.

For its part, SnapLogic updated its Elastic Integration Platform with support for Apache Kafka that includes pre-built data transformation operations and endpoint connectors, while Syncsort said its DMX-h v9 now integrates data from Kafka, as well as mainframe and relational databases, in a pipeline connected to Apache Hadoop and Spark.

Also last week, Kafka originator LinkedIn said it would open source the Kafka Monitor framework, an in-house effort that provides monitoring and testing of Kafka cluster deployments. These and other moves may suggest that, like predecessor Hadoop, Kafka could become a central point in a software ecosystem of its own.

Next Steps

Be there as Kafka maven Kreps discusses the way of the log

Check out HPE's take on Kafka messaging

Learn more about big data in motion

Dig Deeper on Hadoop framework