Kafka Architecture Explained
Preface: How does Kafka work?
Producers
Producer applications send messages to Kafka over a TCP network connection. These applications are written in Java (or another supported language) and send messages as JSON, Avro, or ProtoBuf format. These messages are serialized into byte streams and physically stored in files on Kafka broker servers.
Topics
Messages are categorized into different topics. While topics are often compared to "tables" in a relational database, they are really just logical groupings of partitions.
Partitions
Partitions are what makes the world go round in Kafka. Partitions collectively represent a given topic and are physically located on different brokers in the cluster.
When a producer sends a message to a given topic, it's written to a given partition for that topic. By default, Kafka writes to a topic's partitions in a round robin fashion. Alternatively, producers can specify a key to dictate which partition a message gets written to.
Partitions are replicated across Kafka brokers so that "copies" of partition data live on different machines. If one machine goes down, another can take its place. This is central to achieving fault tolerance in Kafka.
Commit logs
Partitions write messages as linear commit logs. These logs are immutable and record an offset as a sequential identifier for each record in the log.
A partition's commit log always lives on a single broker. This commit log is divided into segment files. When one segment file reaches a configured size, the next message is written to a new segment file.
Consumers
Consumer applications read messages from Kafka. Consumer applications exist in groups. A consumer group reads data from a topic. Each consumer is responsible for reading messages from a partition within the topic.
Consumers read messages from a partition in the same order they are written to the partition. A consumer keeps track of its "current index" via the offset value. If the consumer goes down, it can pick up where it left off via this offset value.
Retention
The data stored in Kafka is retained based on time or size constraints. For example, the default retention policy is 7 days for messages in a given topic.
Retention policies are configured per topic. Log compaction can also be enabled to essentially replace old messages with updated values.
Kafka Architecture Explained
The above description of how Kafka works is nothing new to most Kafka enthusiasts. Producers send messages to topics, topics are organized into partitions, and consumers read messages from these partitions.
While understanding the pub/sub concepts of Kafka is fairly straight forward, explaining the decisions behind Kafka architecture requires a deeper dive...
Producers explained
Producers were designed to be completely decoupled from consumers. This is key to the scalability of Kafka as a messaging platform.
Unlike traditional pub/sub implementations, producers don't need to concern themselves with consumer availability. Since the burden of reading messages is placed on consumer groups themselves, an infinite number of producers can be added without performance implications.
This is a significant improvement over alternative pub/sub implementations where message producers must coordinate more closely with consumer availability before publishing messages.
The decision to decouple producers from consumers also helps with guarantees on message delivery. Without having to worry about consumer availability, Kafka is able to guarantee exactly once delivery etc.
Producers are designed for high throughput and low latency. Producers are configured to accumulate data in memory so they can reduce network communication with brokers.
Producers are configured with time and size constraints to effectively "batch" messages together. By effectively batching messages sent by producers, Kafka is able to significantly reduce i/o and improve latency.
Producers are also designed to optionally trade latency for durability. Through the "ack" configuration, producers have control over guarantees with respect to replication and write acknowledgements.
Of course, the more durable the acknowledgement, the longer the acknowledgement takes. In this sense, producers can be configured to emphasize performance over completeness and visa versa.
Topics explained
Topics provide a logical grouping of data so that messages can be categorized like tables in a database.
Topics themselves don't really do anything. They exist to organize the data you work with in Kafka. Topics don't physically store data. Instead, partitions are responsible for writing to commit logs.
Partitions explained
Partitions are everything in Kafka. They are what allow client applications (publishers, consumers) to read/write to multiple brokers simultaneously. Partitions are also what makes replication possible in Kafka.
Without partitions, data for a particular topic would live entirely on a single broker. If that machine went down, data for the entire topic would be temporarily unavailable or lost.
Furthermore, performance would take a huge hit. A single broker would bare the responsibility of handling all the reads and writes for a topic. That's a lot of stress to put on a single machine, especially when dealing with big data use cases.
With partitions being distributed across the cluster, consumer groups can read data from a single topic in parallel. Likewise, producers can write messages in parallel. This is why partitions are so key to the benefits of using Kafka and why Kafka is so popular today.
Commit logs explained
The data stored in a given partition is represented by a commit log. The commit log represents a group of segmented files stored on disk. All of these files reside on the same physical broker (remember that individual partitions never span multiple servers).
A message is appended to the commit log via a linear disk write operation. There is a common misconception that disk operations are always slower than in memory operations. The truth is that modern OS leverage things like pagecache behind the scenes to make linear writes far faster than in memory operations.
Remember that Kafka is built on top of the JVM. The memory overhead of objects in Java is very high and garbage collection gets increasingly unstable as in-heap data increases. For these reasons, leveraging OS write operations is significantly faster than managing some in-memory cache.
By leveraging linear reads/writes on commit logs, Kafka is able to achieve extremely high throughput and low latency (Reads and writes are in constant time O(1)). Instead of writing as much data as possible to some memory buffer and flushing to the filesystem, Kafka immediately writes messages to the commit log.
This effectively offloads all the work to the OS and avoids the pitfalls of managing memory through Java. Additionally, linear reads/writes allow commit logs to store unlimited amounts of data without significant performance hits.
The segmentation of the commit log over a collection of files is also key to performance in Kafka. Each segment has a corresponding index file which stores the offset and keys of individual messages. Similar to indexing in a database, this makes lookups significantly faster for read operations on a given topic.
The ability to read messages in order (at the partition level) is a key advantage of using Kafka. Commit logs make this possible as they represent a linear stream of messages.
Consumers explained
Kafka implements a "pull" based approach to consuming messages. Consumer groups pull messages from a given topic. Consumer applications within the group pull messages from a specific partition within the topic. This allows consumers to read from a given topic in parallel.
With any message system, the objective is for consumers to consume messages as fast and efficiently as possible. The problem with alternative "push" based approach is that consumers can get overwhelmed with messages being delivered.
By implementing a "pull" based approach, consumers control the rate at which they consume messages. This avoids problems where production rates exceed consumption rates and allows for a more diverse group of consumers to read from Kafka.
This pull based approach also proves more efficient in batching messages. Kafka consumers request messages from a given offset allowing them to receive "batches" of data. While a push based approach can also batch messages, inefficiencies can emerge when latency is emphasized (consumer buffers messages even when a single message are pushed from producers).
Remember that messages consumed are never deleted. This allows multiple consumer applications to read the same messages. This is significantly different from other implementations where consumers compete to consume messages. Additionally, consumers can reread messages if errors occur or systems go down.
By organizing consumers into consumer groups, high throughput is achieved. This is because individual consumers are responsible for individual partitions. 5 consumers reading 5 partitions is a lot faster than 1 consumer reading 5 partitions.
This also points to the fact that simply adding consumers doesn't improve throughput in Kafka. Rather adding n consumers for n partitions is how you maximize the parallel consumption of data from a given topic.
Consumer offsets are stored in Kafka/Zookeeper. When consumers read messages, they commit which offsets they have handled back to Kafka brokers. If a consumer goes down, it can pick up where it left off when it comes back up. This addresses the problem of agreeing on what has been consumed between brokers and consumers.
In traditional messaging systems, brokers record what has been consumed by consumers. Sometimes they wait for an acknowledgement from the consumer to guarantee message delivery. Sometimes they don't wait for this acknowledgment to emphasize low latency.
Obviously there are trade-offs with both approaches. When brokers don't wait for acknowledgement, messages can get lost. When brokers wait for acknowledgment, they can end up sending messages twice (if the consumer consumes the message but fails on acknowledgement). Additionally, managing the state of whats been consumed can get expensive.
The offset solves this problem as single integers are all that is required to record where a particular consumer is in reading messages.
Kafka also designed consumers to optionally block polling for messages until more data arrives. This is a drastic improvement over alternative implementations where consumers continuously poll for messages (even when there aren't any). This polling strategy is obviously wasteful. Kafka consumers implement long polling by default and these settings can be tweaked via configurations like fetch.min.bytes and fetch.wait.max.bytes.
Retention explained
Retention policy is configured at the topic level and dictates how long messages are kept in commit logs. Commit logs retain their data for a configurable amount of time. By default, records are kept for 7 days before they are deleted from Kafka.
Size restraints can also factor into retention policies. A topic can be configured to purge records once the data reaches a certain size.
In addition to time and size constraints, log compaction can also be enabled at the topic level. Log compaction works by consolidating duplicate message keys with a single updated values. This can be useful when consumers only care about the most updated "version" of the messages they are consuming.
For example, say you have a Kafka topic responsible for recording user logins to a website. Each user would have their own partition via a userId key associated with each message produced. Messages would be produced every time the user logs in. Consumers may only care about the last login for a particular user. With log retention enabled, 10 login messages would be compacted to 1 message with the most recent login.
Log compaction is very useful for use cases where consumers only care about most updated values. Log compaction can significantly reduce the number of messages in a given topic and free up space.
Think about Uber drivers uploading their GPS coordinates. Consumers only care about the driver's current location. With millions of coordinates being produced every second, log compaction can work wonders for your cluster.
So how does log compaction work without jeopardizing performance? It turns out log compaction leverages those segment files used by commit logs to periodically recopy older files.
While the most recent segment file is actively being written to or consumed from, log compaction can simultaneously remove outdated values from older segments using its own pool of background threads.
Retention can also be indefinite. There are certainly use cases where Kafka is used as a database and this use case is growing in popularity.
Retention can be both time and size based AND utilize log compaction depending on the nature of the consumers and the use case.
Conclusion
Understanding how Kafka works is the first step to explaining architecture.
Producers write messages to topics and topics are logical groupings of partitions.
Partitions are commit logs consisting of segment files stored on disk.
Consumer groups read from topics in parallel with each consumer in the group reading from a single partition.
Kafka architecture achieves high throughput and low latency via decoupling of producers and consumers.
While Kafka utilizes memory buffers for producing and consuming messages, the actual data is accessed through linear reads and writes on commit logs.
Kafka also offers efficient batching formats to address network bandwidth bottlenecks.
Remember that the simplicity of the commit log and the immutability of its data are fundamental to Kafka's architecture and its resulting popularity today.