- Fotolia

Apache Kafka committer Jay Kreps on the way of the log

To keep big data in motion, LinkedIn developers devised Apache Kafka, a publish-and-subscribe message dispatcher based on the concept of logs.

For a development team at LinkedIn Corp., dealing with big data in motion was a major challenge in shaping some of the defining applications of the modern Web. In the process, the team helped create the Apache Kafka messaging system, which is especially built to move big data.

Like others among a generation of Web app developers, Jay Kreps and his LinkedIn colleagues decided to build their own software to exploit new data infrastructures made possible by massively parallel commodity clusters, and to do so using open sourcing. At LinkedIn, such development supports a host of innovative applications.

"All data management software is getting rewritten to deal with large volumes of data," according to Kreps, who is a principal staff engineer at LinkedIn and an Apache Foundation project committer for the Apache Kafka messaging system. He added that server hardware advances have led to "an almost unlimited pool of machines" that lets you deal with data differently.

LinkedIn has fielded such near-real-time applications as "People You May Know" and "Who Viewed My Profile." In these and other endeavors, developers placed great emphasis on quickly capturing and processing streams of activity from different sources for different applications.

Publish-and-subscribe architecture

Jay Kreps, LinkedInJay Kreps, LinkedIn

While some reporting jobs could run in batch, a big push at LinkedIn was on "operationalizing" activity. Kreps and his colleagues turned to a publish-and-subscribe approach to writing messages. Since it involved "writes," it took on the name of 20th-century existential writer Franz Kafka.

The publish-and-subscribe architecture has roots in some sophisticated middleware used predominantly in high-end financial applications. Publish-and-subscribe is a fairly recent addition to open source settings. Kafka is meant to allow a single, elastic cluster to serve as the central data platform for an organization.

For his part, Kreps positions Kafka between online transaction processing (OLTP) systems that work immediately on data and Hadoop systems that -- until recently -- worked mostly in batch mode. He said many Kafka users are combining it with the Hadoop Distributed File System (HDFS), but HDFS it is not required to run Kafka.

At LinkedIn, Kreps wants to apply Kafka as a central multi-subscriber event log. He described logs as records that are appended as changes occur. With Kafka, machines are set up with subscriptions to specific logs, and they are alerted when updates to these logs are published.

Thoughts on the log metaphor

Krebs sees the log as a useful abstraction for architects working in an enterprise with data in motion. He said the concept shouldn't be unfamiliar, because the log concept is seen in databases and other distributed system implementations. The recipe he sees is to take all the organization's data and put it into a central log for real-time subscriptions.

"The log is about the right way of thinking about data flows in your organization," Krebs said, noting that the log can act as an abstraction for Kafka developers -- much as a file can act as a useful abstraction for Hadoop developers.

Working with the log concept can provide the basis for simplifying approaches to complex systems.

"The abstraction that Kafka provides is a log," he said. "Imagine a file [that] was continuously growing, and it had some kind of notion of position that enabled people subscribing to changes to see it. That is what a log is."

Since it isn't a database, log file collector or traditional messaging system, Krebs admitted Kafka is in a bit of a rarefied atmosphere. But he noted that recent work at Amazon on Kinesis, a piping system for connecting many diverse, distributed data systems, in ways resembles Kafka and its log abstraction. More of the same may follow as data in motion becomes a pressing concern in more enterprise applications.

What is next for Kafka are improvements in security and ''up-stack'' additions, such as add-on support for Samsa stream processing and Storm event processing frameworks, according to Kreps.

A goal of all this infrastructure technology is better data-driven decisions, he said.

"Organizationally, a lot of things that were left to guess work before are now a matter of measurement," he said. Such thinking is behind LinkedIn's customer-facing applications, which continue to be closely watched by other would-be digital enterprises.

Next Steps

Read more about big data in motion

Dig Deeper on Big data management