Kafka, Drilled down to its bits

In my previous article, I had explained about Producers of Kafka, what it does, how it works, settings and configurations to tune to our specific requirements. In this part, we will be talking about brokers. If you remember Kafka is broken down into 3 parts,

  1. Producer
  2. Broker
  3. Consumer

Let's talk about the broker.

What is a broker anyway?

In simple words, the Broker is a Kafka server, Period. Kafka broker gets the message from the producer and sends it to the consumer for further processing.

If you have read the Producer article, there I have talked about partition in a broker.

Now let's see what it means, in the broker, we can divide a particular topic into multiple partitions, more number of partitions more parallelism.

Choosing the proper number of partitions for a topic is the key to achieving a high degree of parallelism with respect to write and read and to distribute load. Evenly distributed load over partitions is a key factor to have good throughput (avoid hot spots).

A simple formula would be #Partitions = max(NP, NC) where Np = Number of producers and Nc = Number of the consumer.

But there is much more to partition, like every partition we make on a production system should have at least 3 replicas for failsafe. i.e Replication factor should be 3

Each partition has a leader server(It refers to the broker where the leader partition is residing) and zero or more follower servers. Leaders handle all read and write requests for a partition. Followers replicate leaders and take over if the leader dies. The leader of the partition is elected by the broker who is known as the controller and who is in turn elected by the zookeeper. This controller detects failures at the broker level and is responsible for changing the leader of all affected partitions in a failed broker.

But all this failsafe mechanism comes at a cost like

  1. Broker Size Replication factor directly impacts the overall broker disk size
  2. A large Number of Partitions replication: In case of a large number of partitions replication extra latency is added.

Refer to this image for the partition leader and follower system.

Here partition 0 and 3 are the leader in Broker 0 and so on.

Here, we would notice one thing, all the partition replications are in different brokers. The reason, if the broker fails replica shouldn't fail.

One more take away is, the producer is sending data to leader partition.

However, we should specify number of minimum in sync replicas by

Now let's tackle some edge cases over here.

What will happen if the broker with leader partition fails without in sync with the follower?

If the offline broker was a leader, a new leader is elected from the replicas that are in-sync. If no replicas are in-sync it will only elect an out of sync replica if unclean.leader.election.enable is true, otherwise, the partition will be offline.

If the offline broker was a follower, it will be marked an out of sync by the leader.

When restarting the broker, it will try to get back in sync. Once done, whether it stays a follower or becomes the leader depends if it is the preferred replica.

Finally, if you know a broker will be offline for a long time and still require a replica, you can use the reassignment tool kafka-reassign-partitions.sh to move partitions to online brokers.

What determines the partition if it is in-sync?

There is a configuration known as replica.lag.time.max.ms, which determine the max lag an in-sync replica can have, default is 500ms, i.e even if the replica is lagging by 500 ms it is termed as in-sync replica. We could reduce the time according to our needs and requirement.

This pretty much takes care about partition on a superficial level, but what happens inside the partition, how is the data stored?

Next we will take a look at all those questions.

Going a level deeper we found that Partitions is internally separated known as Segments

When Kafka writes to a partition, it writes to a segment — the active segment. If the segment’s size limit is reached, a new segment is opened and that becomes the new active segment and the write proceeds

Max size of a segment roughly equivalent to ~1gb, after that it will go to the creation of the next segment.

Segments are named by their base offset, this image will sum it up.

On disk, a partition is a directory and each segment is an index file and a log file.

2 files

  1. filename.index
  2. filename.log

Logs are where the message is stored, and the index is where the position of the log is stored.

The segment index store the position of the log. (Image by google images)

The index file is memory-mapped, and the offset lookup uses binary search to find the nearest offset less than or equal to the target offset.

By this, we come at the end of the article, the next article would be dedicated to Consumers, and we will try to figure out what makes them tick.

SDE-II at Paypal. Contact for any queries alpitanand20@gmail.com