When you should be using Kafka
Curious how Kafka compares to the alternatives? Check out JMS vs Kafka.
Kafka is a hot technology packed with lots of buzz words. Distributed, fault tolerant, real time, event based, scalable messaging. Uber. Netflix. What's not to love?
While it's true that Kafka has established itself as a rock solid solution to scalable micro-service architecture, Kafka may be overkill depending on what you're trying to do with you're applications.
Let's take a closer look at the key advantages of Kafka to see if it really is appropriate for your system's needs.
Preface: What is Kafka?
Kafka was originally developed by LinkedIn. It's now open source and used by big companies like Uber, Netflix, and PayPal to manage real time event streaming.
Kafka is a distributed messaging system. It's similar to RabbitMQ, JMS, or another event based messaging system where publishers raise events that are read by consumers.
How does Kafka work?
Applications called producers send messages to topics. These topics are like tables in a traditional database but their data is partitioned across a cluster of different machines. This makes the amount of data you can store in a given "table" infinite as partitions can span multiple servers etc. These partitions are also replicated to prevent data loss and increase fault tolerance.
Groups of consumer applications then read from these topics in an optimal way (maximizing throughput, minimizing latency, and achieving fault tolerance). By allowing multiple consumers to read from the same event log, Kafka can scale to accommodate an infinite number of applications reading the same data independently.
Further details surrounding Kafka magic, how it works, and how to get started can be found in our Apache Kafka tutorial
The Key Advantage of Kafka
Kafka shines as an event based messaging system. It gives you the scalability of a more traditional queue based system and the flexibility of a multi-subscriber pub/sub based system...
Prior to Kafka, you couldn't get both. While a traditional queue allows you to divide up processing efforts over multiple consumers, you can't emit the same events multiple times to different consumers. This is because queues leverage "competing consumers" to efficiently read messages. When consumers compete to read shared information, you can't guarantee the order of messages or read the same data twice.
Conversely with pub/sub you can emit the same messages to multiple consumers but you sacrifice processing. This is because message brokers have to send the same messages to whoever subscribes. Every subscriber you add increases the work the message broker has to do for each event it raises.
Scalability and Multi Subscriber | The Best of Both Worlds
Kafka represents the best of both worlds. With Kafka's consumer group model, processing is efficiently distributed across a collection of consumer applications reading from a given topic. These same consumer groups are what allows Kafka to broadcast messages to multiple consumers. This is because X number of consumer groups can independently read from the same topic /(remember that Kafka retains messages when they are read).
Kafka vs RabbitMQ
RabbitMQ is one of the most popular alternatives to Kafka. It provides a more "push based" approach to messaging. Message brokers "push" messages to consumers versus Kafka where consumers "pull" messages from a topic.
Push vs Pull
With the push approach, consumers compete to read messages individually and quickly. The broker says "hey I have a message" and the first consumer that can consume the message takes it.
With pull approach, consumers "pull" or read messages from a given topic. Rather than consume the message (leaving it unavailable for other consumers to read), consumers read messages and commit offsets to basically bookmark where they are in the log stream. This lets X number of other consumer groups consume the same information in parallel.
With Kafka, consumers can pull data based on their current state / capabilities. With RabbitMQ, consumers are forced to take messages regardless of their state.
While you can use Kafka for basic messaging, you can't use RabbitMQ for Kafka specific things like streaming data in real time.
Kafka Use Cases:
The official documentation lists the following as good use cases...
- messaging
- website activity tracking
- metrics
- log aggregation
- stream processing
- event sourcing
- commit log
Can Kafka Be Used as a Database?
YES. Kafka can be used to store data. Remember that Kafka is a persistent stream of log messages. By using log compaction and retention policies, you can store the data you collect in Kafka forever if you want...
One of the main reasons developers are hesitant to utilize Kafka as a storage engine is the misconception that Kafka is architected like more traditional messaging systems (like RabbitMQ).
Remember data you store in Kafka is persisted to disk and replicated for fault tolerance. Jay Kreps does an amazing job of explaining this in his article It's Okay To Store Data In Apache Kafka.
Is Kafka Overkill?
Kafka can be overkill if your application(s) won't benefit from the real time streaming of millions of records.
When should I use Kafka?
Use Kafka if you have a bunch of microservices that need to communicate at scale or if you need to stream process large amounts of data in real time.
Don't use Kafka if you have a monolithic application that doesn't need to communicate excessively with other services or won't be handling large amounts of data in real time.
Kafka makes sense when you are working with millions of evens a minute and you want to process those events in real time.