Sergey Nivens - Fotolia

Confluent's Kafka data-streaming framework gets 'SQL-ized'

SQL on Hadoop arrived -- so did SQL on Spark. Now, SQL on Kafka is emerging to provide a different way to look at Kafka data as it streams through the enterprise.

The day when armies of business analysts can query incoming data in real time may be drawing closer. Supporting...

such continuous interactive queries is a goal of KSQL, software put forward this week by the Kafka data-streaming software originators at Confluent Inc.

KSQL is a SQL engine that directly handles Apache Kafka data streams. As such, it can skip other big data components that bring broadly supported SQL capabilities to Hadoop and Spark, but may require intermediate data stores and batch-oriented processing.

The software is intended to bridge the gap between real-time monitoring and real-time analytics, according to Neha Narkhede, co-founder and CTO at Confluent, based in Palo Alto, Calif. At the Kafka Summit in San Francisco, she said KSQL can continuously join streaming data, such as web user clicks, with relevant table-based data, such as account information.

She also said KSQL is intended to broaden the use of Kafka beyond Java and Python, opening up Kafka programming to developers familiar with SQL; although, the form of SQL Confluent is using here is a dialect, one the company has developed to deal with the unique architecture of Kafka streaming. The software is appearing first as a developer preview, and it will be available under an Apache 2.0 license, according to the company.

Kafka data on the move

Created at LinkedIn, Kafka began life as a publish-and-subscribe messaging system that focused on handling log files as system events. It became an Apache Software Foundation project, and it was expanded to support a fuller data-streaming architecture.

The open source version of Kafka is commonly used in Hadoop and Spark data pipelines today. That puts it at the center of much of the industry activity aimed at putting big data into motion.

"Overall, we are seeing Kafka growing across large enterprises, in startups and in job posts. Companies are looking for people with Kafka skill sets," said Fintan Ryan, an analyst at Portland, Maine-based RedMonk. "Underlying that is a drive in use of streaming data in general."

Much of the current streaming data landscape is centered on Kafka. But a grab bag of alternatives exists.

Alternatives are found in long-standing software, such as RabbitMQ and Tibco StreamBase; in later entries, such as Amazon Kinesis and Apache Spark Streaming; and in newly emerging frameworks, such as Apache Flink, Microsoft Azure Event Grid and others. Just this month, startup Streamlio emerged from stealth mode, describing its efforts to promote enterprise streaming based on Heron -- a stream processor emerging from distributed systems work at another social media mainstay, Twitter.

The goal of the newly released KSQL is to bring Kafka streaming programming directly to SQL-capable developers. For example, it's meant to join click streams via continuous queries with table data.
The goal of the newly released KSQL is to bring Kafka streaming programming directly to SQL-capable developers. For example, it's meant to join click streams via continuous queries with table data.

Waiting for the SQL

For now, Confluent's KSQL is programmed via a command-line interface, Ryan noted. That means opportunity, he said, for other software vendors to build drag-and-drop interfaces that tap into Kafka via KSQL. In fact, at the Kafka Summit, analytics software provider Arcadia Data said it was working with Confluent to support a visual interface for interactive queries on Kafka topics, or Kafka message containers, via KSQL.

Confluent's KSQL scheme meets competition among a handful of players that have already been working to connect Kafka with SQL. Some of those players were on hand at the Kafka Summit with product updates.

At the conference, Striim Inc. unrolled its 3.7.4 platform release, adding more monitoring metrics and Kafka diagnostic utilities, as well as new connectors to Amazon Web Services Redshift and Simple Storage Service, Google Cloud, Azure SQL Server, Azure Storage and Azure HDInsight. Also at the summit, SQLstream launched Blaze 5.2, supporting Apache SystemML for declarative programming of machine learning applications.

Kafka SQL links and other streaming activity should not belie the fact that big data streaming architecture is still a young discipline. That is emphasized by recent word that Apache Kafka would formally reach version 1.0.0 in early October.

Software veterans recall there was a time when development managers would wait until release 2.0 before touching any software, and release 1.0 was a nonstarter. But it seems the speed of data streaming today is such that those types of caution are out the window, at least where organizations sense the potential for significant business advantage.

Next Steps

Learn about Kafka's shift to include data streaming

Confluent updates Kafka streaming tools with UDFs

Learn about 'exactly-once' processing from Confluent CTO

Kafka maven Kreps discusses the way of the log

Dig Deeper on Hadoop framework