Sergey Nivens - Fotolia
As a senior member of a data team at LinkedIn earlier in this decade, Neha Narkhede played an important part in the creation of the Kafka distributed messaging system. Kafka became an open source Apache project, and it has come to include stream processing. It stands as one of the most successful data-processing frameworks to emerge in recent years.
With LinkedIn software architect Jay Krebs and others, Narkhede went on to form Confluent Inc., a Palo Alto, Calif., startup that is working to address the problem of hard-to-digest, fast-arriving big data with Kafka streaming. SearchDataManagement spoke with Narkhede, who is Confluent's CTO, as the company prepared to host next month's Kafka Summit in San Francisco. The event will cover Kafka streaming and a variety of other issues related to managing data in motion.
Can you walk us through some of the steps that led to Kafka and Kafka streaming?
Neha Narkhede: When I joined LinkedIn, we were entering our big growth phase, going from a couple million users to many millions of users. The situation we were facing was infrastructure was breaking down; it wasn't really scaling to the point the company needed.
Another thing to note about LinkedIn is that it is a very data-oriented company. Making use of our data in real time was a big trend. The problem we were facing was we had a lot of data coming in from our users that we wanted to process quickly, and then feed the information to all the downstream systems.
Just like other companies, we had Hadoop and all these distributed databases, and the problem was how to end up with a system that gives you a single-source-of-truth view, as well as to be able to process data in real time. And back then, there weren't very many solutions.
There were messaging systems that could do real time, but couldn't scale. There were ETL [extract, transform and load] tools that could do scale, but couldn't do real time. It was a complicated problem no one really wanted to touch. That's why we built Kafka. We felt it helped problems at LinkedIn and thought it might address problems that every company had.
What were some of the original principles behind Kafka?
Narkhede: If you look at Kafka, it is really very much like a large-scale distributed log that was very similar to the back end of a traditional database, but also built as a serious, practical distributed system that was broadly applicable.
Neha NarkhedeCTO at Confluent
Before Kafka, the limitation was that you could only process information created by humans in real time, because the rate at which humans could create information, such as orders, sales or shipments, was far lower than the rate at which machines create information.
That related to a big change in the way companies wanted to operate. They wanted to become more digital -- collecting information from IoT [internet of things] devices, from machines doing IT monitoring and so on.
That information was at least an order of magnitude larger than anything created by humans. Handling big data that comes at you faster than you ever thought -- that is the type of data processing that Kafka is spearheading.
Basically, we started with people that had distributed systems experience and people who had database experience. It included people who had dabbled in the old-time stream-processing systems and knew their drawbacks. Very much from first principles, we asked what would be the core foundation of this system, what it would look like. As it turned out, a lot of it was about principles from the database world applied to distributed systems.
The 411 on Kafka
Learn more about Kafka in this podcast, as Neha Narkhede discusses messaging protocols that guarantee "exactly once" delivery. Also, find out why it's called Kafka.
Open source is a way to get developer mind share for Kafka. What happens next for someone building a company around that technology?
Narkhede: When Confluent was formed, the first two areas we looked into were stream processing and streaming data pipelines. With the Kafka Streams API, we invested in creating a stream-processing engine that is now part of native Apache Kafka. Meanwhile, streaming data pipelines are about being able to connect well-known systems to Kafka. So, we invested in the Kafka Connect API. If you look at Kafka now, it has evolved from just a messaging system into a full-fledged streaming platform.
As far as open source goes, developers decide to give Kafka a try. That is not difficult because it is open source. Then, when they go to production, they find use for things that are in our enterprise edition. They need management and monitoring. They need a user interface to visualize millions of messages going through the system. They need to be able to see how healthy their Kafka implementation is.
Another thing today is that every company has a wide footprint, whether it is across multiple geographically distributed data centers or a combination of on-prem and cloud implementations. There, Kafka is used as a real-time data bus, or a bridge to the cloud. So, the Confluent Enterprise edition has added Replicator, which enables you to do that bridging across data centers. We also have a subscription-based service that includes support and training, as well as a hosted offering for those that are entirely in the cloud.
How data pipelines are changing Hadoop status quo
Be there as streaming comes to Kafka
Listen to a podcast on Kafka streaming issues