> ## Documentation Index
> Fetch the complete documentation index at: https://docs.streamnative.io/llms.txt
> Use this file to discover all available pages before exploring further.

# Optimize Kafka Client for Availability

To optimize for high availability, you need to tune your Kafka application to recover quickly from failure scenarios. This involves configuring parameters that control failure detection, recovery, and state restoration.

The configuration parameters discussed in this guide have varying ranges of values. The optimal settings depend on your chosen [data streaming engine](/cloud/overview/data-streaming-engine), specific requirements, and environmental factors such as average message size, number of partitions, and other system characteristics. Therefore, [benchmarking](/cloud/build/kafka-clients/optimize-and-tune/optimize-kafka-clients#benchmarking) is essential to validate and fine-tune the configuration for your particular application and environment.

## Write Quorum and Acknowledgment Quorum Size

<Note title="Note">
  The following configurations apply to the **Classic Engine** only. They do not apply to the **Ursa Engine**.
</Note>

When a producer sets `acks=all` or `acks=-1`, two configuration parameters control message replication:

* `managedLedgerDefaultWriteQuorum`: Specifies the replication factor for storing messages (number of replicas)
* `managedLedgerDefaultAckQuorum`: Specifies the minimum number of replicas that must acknowledge a write before it is considered successful

If the minimum acknowledgment quorum cannot be met, the producer raises an exception. To improve data availability:

* Increase the write quorum to maintain more replicas of the data
* Increase the difference between write quorum and acknowledgment quorum sizes to improve write availability while maintaining durability guarantees

These configurations can be set on a per-namespace or per-topic basis based on your specific requirements.

## Consumer failures

Consumers in a consumer group can share processing load. If a consumer unexpectedly fails, StreamNative Cloud detects the failure and rebalances the partitions amongst the remaining consumers in the consumer group. Consumer failures can be either hard failures (for example, `SIGKILL`) or soft failures (for example, **expired session timeouts**). These failures are detected when consumers fail to send heartbeats or `poll()` calls.

Consumer liveness is maintained with a heartbeat (running in a background thread since [KIP-62](https://cwiki.apache.org/confluence/display/KAFKA/KIP-62%3A+Allow+consumer+to+send+heartbeats+from+a+background+thread)). The `session.timeout.ms` configuration parameter dictates the timeout used to detect failed heartbeats. You can increase the session timeout to account for potential network delays and avoid soft failures.

Soft failures most commonly occur in two cases:

* When `poll()` returns a batch of messages that take too long to process
* When a JVM GC pause takes too long

If you have a `poll()` loop that spends significant time processing messages, you can:

* Increase `max.poll.interval.ms` to allow more time between fetching records
* Reduce `max.poll.records` to decrease the batch size returned

While higher session timeouts increase the time needed to detect and recover from consumer failures, failed client incidents are generally less frequent than network issues.

## Summary

Here's a summary of key configurations for optimizing availability:

### Consumer Configurations

| Configuration        | Recommended Value | Default Value | Description                               |
| -------------------- | ----------------- | ------------- | ----------------------------------------- |
| `session.timeout.ms` | 30000-60000       | 45000         | Time before consumer is considered failed |
