Kafka, Drilled down to its bits

5 min readMay 10, 2020

In my previous article, I had explained about Producers of Kafka, what it does, how it works, settings and configurations to tune to our specific requirements. In this part, we will be talking about brokers. If you remember Kafka is broken down into 3 parts,

Producer
Broker
Consumer

Let's talk about the broker.

What is a broker anyway?

In simple words, the Broker is a Kafka server, Period. Kafka broker gets the message from the producer and sends it to the consumer for further processing.

If you have read the Producer article, there I have talked about partition in a broker.

Partitions allow you to parallelize a topic by splitting the data in a particular topic across multiple brokers — each partition can be placed on a separate machine to allow multiple consumers to read from a topic in parallel.

Now let's see what it means, in the broker, we can divide a particular topic into multiple partitions, more number of partitions more parallelism.

Choosing the proper number of partitions for a topic is the key to achieving a high degree of parallelism with respect to write and read and to distribute load. Evenly distributed load over partitions is a key factor to have good throughput (avoid hot spots).

For example, if you want to be able to read 1 GB/sec, but your consumer is only able process 50 MB/sec, then you need at least 20 partitions and 20 consumers in the consumer group.

A simple formula would be #Partitions = max(NP, NC) where Np = Number of producers and Nc = Number of the consumer.

But there is much more to partition, like every partition we make on a production system should have at least 3 replicas for failsafe. i.e Replication factor should be 3

Replication factor determines number of replications each partition have, this allows Kafka to automatically failover to these replicas when a server in the cluster fails so that messages remain available in case of failures

Each partition has a leader server(It refers to the broker where the leader partition is residing) and zero or more follower servers. Leaders handle all read and write requests for a partition. Followers replicate leaders and take over if the leader dies. The leader of the partition is elected by the broker who is known as the controller and who is in turn elected by the zookeeper. This controller detects failures at the broker level and is responsible for changing the leader of all affected partitions in a failed broker.

But all this failsafe mechanism comes at a cost like

Broker Size Replication factor directly impacts the overall broker disk size
A large Number of Partitions replication: In case of a large number of partitions replication extra latency is added.

Refer to this image for the partition leader and follower system.

Here partition 0 and 3 are the leader in Broker 0 and so on.

Here, we would notice one thing, all the partition replications are in different brokers. The reason, if the broker fails replica shouldn't fail.

One more take away is, the producer is sending data to leader partition.

However, we should specify number of minimum in sync replicas by

min.insync.replicas is the minimum number of copies of the data that you are willing to be online at any time to continue running and accepting new incoming messages.

Now let's tackle some edge cases over here.

What will happen if the broker with leader partition fails without in sync with the follower?

If the offline broker was a leader, a new leader is elected from the replicas that are in-sync. If no replicas are in-sync it will only elect an out of sync replica if unclean.leader.election.enable is true, otherwise, the partition will be offline.

If the offline broker was a follower, it will be marked an out of sync by the leader.

When restarting the broker, it will try to get back in sync. Once done, whether it stays a follower or becomes the leader depends if it is the preferred replica.

Finally, if you know a broker will be offline for a long time and still require a replica, you can use the reassignment tool kafka-reassign-partitions.sh to move partitions to online brokers.

What determines the partition if it is in-sync?

There is a configuration known as replica.lag.time.max.ms, which determine the max lag an in-sync replica can have, default is 500ms, i.e even if the replica is lagging by 500 ms it is termed as in-sync replica. We could reduce the time according to our needs and requirement.

This pretty much takes care about partition on a superficial level, but what happens inside the partition, how is the data stored?

Next we will take a look at all those questions.

Going a level deeper we found that Partitions is internally separated known as Segments

When Kafka writes to a partition, it writes to a segment — the active segment. If the segment’s size limit is reached, a new segment is opened and that becomes the new active segment and the write proceeds

Max size of a segment roughly equivalent to ~1gb, after that it will go to the creation of the next segment.

Segments are named by their base offset, this image will sum it up.

On disk, a partition is a directory and each segment is an index file and a log file.

2 files

filename.index
filename.log

Logs are where the message is stored, and the index is where the position of the log is stored.

The segment index store the position of the log. (Image by google images)

The index file is memory-mapped, and the offset lookup uses binary search to find the nearest offset less than or equal to the target offset.

It is the fastest and most flexible cache organization that uses an associative memory. The associative memory stores both the address and content of the memory word.

By this, we come at the end of the article, the next article would be dedicated to Consumers, and we will try to figure out what makes them tick.

Claps please. :-), will love the feedback.

Kafka, Drilled down to its bits

Written by Alpit Anand