- Build Applications
- Kafka Clients
- Optimize and Tune
Optimize Kafka Client for Availability
To optimize for high availability, you need to tune your Kafka application to recover quickly from failure scenarios. This involves configuring parameters that control failure detection, recovery, and state restoration.
The configuration parameters discussed in this guide have varying ranges of values. The optimal settings depend on your chosen data streaming engine, specific requirements, and environmental factors such as average message size, number of partitions, and other system characteristics. Therefore, benchmarking is essential to validate and fine-tune the configuration for your particular application and environment.
Write Quorum and Acknowledgment Quorum Size
Note
The following configurations apply to the Classic Engine only. They do not apply to the Ursa Engine.
When a producer sets acks=all
or acks=-1
, two configuration parameters control message replication:
managedLedgerDefaultWriteQuorum
: Specifies the replication factor for storing messages (number of replicas)managedLedgerDefaultAckQuorum
: Specifies the minimum number of replicas that must acknowledge a write before it is considered successful
If the minimum acknowledgment quorum cannot be met, the producer raises an exception. To improve data availability:
- Increase the write quorum to maintain more replicas of the data
- Increase the difference between write quorum and acknowledgment quorum sizes to improve write availability while maintaining durability guarantees
These configurations can be set on a per-namespace or per-topic basis based on your specific requirements.
Consumer failures
Consumers in a consumer group can share processing load. If a consumer unexpectedly fails, StreamNative Cloud detects the failure and rebalances the partitions amongst the remaining consumers in the consumer group. Consumer failures can be either hard failures (for example, SIGKILL
) or soft failures (for example, expired session timeouts). These failures are detected when consumers fail to send heartbeats or poll()
calls.
Consumer liveness is maintained with a heartbeat (running in a background thread since KIP-62). The session.timeout.ms
configuration parameter dictates the timeout used to detect failed heartbeats. You can increase the session timeout to account for potential network delays and avoid soft failures.
Soft failures most commonly occur in two cases:
- When
poll()
returns a batch of messages that take too long to process - When a JVM GC pause takes too long
If you have a poll()
loop that spends significant time processing messages, you can:
- Increase
max.poll.interval.ms
to allow more time between fetching records - Reduce
max.poll.records
to decrease the batch size returned
While higher session timeouts increase the time needed to detect and recover from consumer failures, failed client incidents are generally less frequent than network issues.
Summary
Here's a summary of key configurations for optimizing availability:
Consumer Configurations
Configuration | Recommended Value | Default Value | Description |
---|---|---|---|
session.timeout.ms | 30000-60000 | 45000 | Time before consumer is considered failed |