In-Depth Summary of Apache Kafka

Guidelines on how to achieve real-time data and stream processing at scale

Meet Kafka

How we move the data becomes nearly as important as the data itself. Publish/subscribe messaging is a pattern that is characterized by the senders (publishers) of a piece of data (message) and consumers of the message that are loosely decoupled from each other. Pub/sub systems like Kafka are often designed with the help of an intermediary broker to orchestrate this pattern to enable diverse use cases.

Image for post
Image for post
  • The consumer subscribes to one or more topics and reads the messages. The consumer keeps track of which messages it has already consumed by keeping track of the offset of messages. The offset is another bit of metadata — an integer value that continually increases — that Kafka adds to each message as it is produced. Each message in a given partition has a unique offset. By storing the offset of the last consumed message for each partition, a consumer can stop and restart without losing its place. Consumers work as part of a consumer group, which is one or more consumers that work together to consume a topic. The group assures that each partition is only consumed by one member. The mapping of a consumer to a partition is often called ownership of the partition by the consumer. In this way, consumers can horizontally scale to consume topics with a large number of messages. Additionally, if a single consumer fails, the remaining members of the group will rebalance the partitions being consumed to take over for the missing member. Kafka is also designed for multiple consumers to read any single stream of messages without interfering with each other.
Image for post
Image for post
Image for post
Image for post
Image for post
Image for post

Kafka Producers: Writing Messages to Kafka

Whether you use Kafka as a queue, message bus, or data storage platform, you will always use it with a producer that writes data, a consumer that reads data, or an application that serves both roles.

Image for post
Image for post
  • Synchronous send: We send a message, the send() method returns a Future object, and we use get() to wait on the future and see if the send() was successful or not.
  • Asynchronous send: We call the send() method with a callback function, which gets triggered when it receives a response from the Kafka broker that lets you to handle errors as well.

Configuring Producers

  • Configuring acknowledgements for the reliable delivery of messages is possible, or also best-effort (not reliable), which is not expecting acks, is providing high throughput.
  • Buffer memory size is important to preserve ensued message waiting for delivery, and it’s possible to configure a timeout for blocking on buffer’s availability.
  • By enabling compression, you reduce network utilization and storage.
  • Control how many times the producer will retry sending the message before giving up and notifying the client of an issue, and also backoff duration in between retries.
  • Set the batch size in bytes to a high number to avoid separate writes for the same partition, however, this does not mean that the producer will wait for the batch to become full.
  • The linger duration in ms controls the amount of time to wait for additional messages before sending the current batch (or sent immediately in default).
  • If the order of messages is not a concern, the throughput can be increased by increasing the number of in-flight requests per connection (how many messages the producer will send to the server without receiving responses), however the throughput will start decreasing once this number makes the batching less efficient.
  • The request timeout durations while waiting for a reply (that can cause retries) can be configured for both producers and brokers, and also the max blocking time for send calls not returning i.e. due to full buffer.
  • You can control the size of a request sent by the producer. It caps both the size of the largest message that can be sent and the number of messages that the producer can send in one request.
  • It is a good idea to increase the TCP send and receive buffers used by the sockets when producers or consumers communicate with brokers in a different datacenter because those network links typically have higher latency and lower bandwidth.
Image for post
Image for post

Kafka Consumers: Reading Data from Kafka

Kafka consumers subscribe to topics and receive messages for those subscribed topics. It is possible for one consumer to subscribe to multiple topics and also use a regular expression for the topic names. Each consumer is belonged to one consumer group if not created in the default one, and all consumers in the same consumer group split all available partitions of the topic.

Image for post
Image for post

Configuring Consumers

  • You can configure the consumer to fetch less times by increasing the min bytes to fetch that helps reducing the load on the brokers in case there are many consumers.
  • Yet if you still want to limit the potential latency, you can configure the max wait in ms to fetch new data.
  • It is possible to control the maximum number of bytes the server will return per partition to the consumer, but it must be larger than the largest message a broker will accept.
  • Since the consumer must call poll() frequently enough to avoid session timeout and subsequent rebalance, the size of the data received shouldn’t be too big to process before the session timeout. If more than the session timeout in ms passes without the consumer sending a heartbeat to the group coordinator, it is considered dead and the group coordinator will trigger a rebalance of the consumer group to allocate partitions from the dead consumer to the other consumers in the group. There should be a balance while picking this duration that will avoid taking too long to detect a failure and unwanted rebalances.
  • For the new consumer, it is possible to configure reading the newest records (records that were written after the consumer started running) or reading all the data in the partition starting from the very beginning.
  • You can control whether the consumer will commit offsets automatically, or when offsets are committed, which is necessary to minimize duplicates and avoid missing data.
  • The maximum number of records that a single call to poll() will return can be controlled to limit the amount of data your application will process in the polling loop.
  • The sizes of the low level TCP send and receive buffers used by the sockets when writing and reading data are also configurable. It can be a good idea to increase those when producers or consumers communicate with brokers in a different datacenter with high latency/low bandwidth.
  • RoundRobin: Takes all the partitions from all subscribed topics and assigns them to consumers sequentially, one by one. In general, if all consumers are subscribed to the same topics, RoundRobin assignment will end up with all consumers having the same number of partitions.

Managing Offsets

Kafka does not track acknowledgments from consumers the way many queues do, instead, it requires consumers committing the last offset read with a special topic to track their position in each partition. After a rebalance, each consumer may be assigned a new set of partitions. In order to know where to pick up the work, the consumer will read the latest committed offset of each partition and continue from there. Committing offsets for the messages not processed can cause losing those in case there is a failure, or processing messages and not committing can cause duplicate processing if the consumer fails.

Summary

This has described how brokers, producers and consumers work together and are configured within the Kafka cluster. Understanding the Kafka internals in-depth is also useful when tuning Kafka for Kafka practitioners i.e.:

  • How Kafka handles requests from producers and consumers
  • How Kafka handles storage such as file format and indexes

The summary (and illustrations) in this post is based on the book: “Kafka: The Definitive Guide

Written by

Software development engineer. Distributed systems enthusiast. #data, #iot, #mobile, #scalability, #cplusplus, #java https://github.com/aozturk

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store