1. Build Applications
  2. Kafka Clients
  3. Optimize and Tune

Optimize Kafka Client for Durability

Durability refers to ensuring messages are not lost during transmission and storage. StreamNative Cloud provides durability through its storage layer in two ways:

  • For Classic Engine clusters, messages are replicated across multiple storage nodes to protect against data loss
  • For Ursa Engine clusters, messages are persisted to object storage

The configuration parameters discussed in this guide have varying ranges of values. The optimal settings depend on your chosen data streaming engine, specific requirements, and environmental factors such as average message size, number of partitions, and other system characteristics. Therefore, benchmarking is essential to validate and fine-tune the configuration for your particular application and environment.

Producer Acknowledgments

Producers can control the durability of messages written to Kafka through the acks configuration parameter. Although you can use the acks parameter for throughput and latency optimization, it is primarily used to ensure message durability.

To optimize for high durability, set acks=all (equivalent to acks=-1). With this setting:

  • For Classic Engine clusters, the broker waits for acknowledgment from all storage nodes before responding to the producer
  • For Ursa Engine clusters, the broker waits for acknowledgment from object storage before responding to the producer

This provides the strongest available guarantees that messages won't be lost. The trade-off is higher latency since the broker must wait for all acknowledgments before responding to the producer.

Duplication and Ordering

Producers can increase durability by retrying failed message sends to prevent data loss. The producer automatically retries sending messages up to the number specified by the retries parameter (default MAX_INT) and up to the time duration specified by delivery.timeout.ms (default 120000ms). You can tune delivery.timeout.ms to set an upper bound on the total time between sending a message and receiving an acknowledgment from the broker, which should align with your business requirements for message validity.

There are two key considerations with automatic producer retries:

  1. Duplication: Transient failures in StreamNative Cloud may cause the producer to send duplicate messages when retrying.
  2. Ordering: Multiple sends may be "in flight" simultaneously, and a retry of a failed message may occur after a newer message has succeeded.

To address both concerns, configure the producer for idempotency by setting enable.idempotence=true. With idempotency enabled, brokers track messages using incrementing sequence numbers (similar to TCP). This prevents message duplication because brokers ignore duplicate sequence numbers, and preserves message ordering because on failures, the producer temporarily constrains to a single in-flight message until sequencing is restored. If idempotency guarantees cannot be satisfied, the producer raises a fatal error and rejects further sends. Applications should catch and handle these fatal errors appropriately.

If you don't configure producer idempotency but require these guarantees, you must handle potential duplication and ordering issues differently:

  • For duplication: Build consumer application logic to handle duplicate messages
  • For ordering: Either:
    • Set max.in.flight.requests.per.connection=1 to allow only one request at a time
    • Set retries=0 to preserve order while allowing pipelining (accepting potential message loss)

Instead of automatic retries, you can handle retries manually by coding exception handlers in the producer client (for example, using the onCompletion() method in the Java client's Callback interface). For manual retry handling, disable automatic retries with retries=0. Note that producer idempotency only works with automatic retries enabled - manual retries generate new sequence numbers that bypass duplication detection. While disabling automatic retries may create message gaps from individual failures, the broker still preserves the order of received writes.

Consumer Offsets and Auto Commit

When optimizing for durability, you need to carefully consider how consumer offsets are managed, especially in the case of unexpected consumer failures. Consumer offsets track which messages have been consumed, making the timing and method of offset commits crucial for durability. A key scenario to avoid is when a consumer commits an offset for a message, begins processing that message, but then fails unexpectedly. In this case, when a new consumer takes over the partition, it won't reprocess any messages with offsets that were already committed, potentially leading to data loss.

By default, consumer offsets are automatically committed during the consumer's poll() call at regular intervals. While this default behavior works well for many use cases, you may need stronger guarantees if your consumer is part of a transactional chain. In such cases, you might want to ensure offsets are only committed after messages are fully processed.

You can control whether offset commits happen automatically or manually using the enable.auto.commit parameter:

  • With enable.auto.commit=true (default), offsets are committed automatically during polling
  • With enable.auto.commit=false, you must explicitly commit offsets in your consumer code using either:
    • commitSync() for synchronous commits
    • commitAsync() for asynchronous commits

For maximum durability, consider disabling automatic commits and explicitly managing offset commits in your application code after successful message processing.

Exactly Once Semantics (EOS)

Note

EOS transactions are supported for Classic Engine clusters only.

For the strongest message delivery guarantees, you can configure your applications to use Exactly Once Semantics (EOS) transactions. EOS transactions enable atomic writes across multiple Kafka topics and partitions.

Since messages in the log may be in various states of a transaction, consumers can control which messages they receive using the isolation.level configuration parameter:

  • Setting isolation.level=read_committed ensures consumers only receive:
    • Non-transactional messages
    • Committed transactional messages
    • No messages from open or aborted transactions

To implement transactional semantics in a consume-process-produce pattern and ensure exactly-once processing:

  1. Set enable.auto.commit=false on the consumer
  2. Manually commit offsets using the sendOffsetsToTransaction() method in the KafkaProducer interface
  3. For event streaming applications, configure the processing.guarantee parameter for exactly-once processing

Summary

Here's a summary of key configurations for optimizing durability:

Producer Configurations

ConfigurationRecommended ValueDefault ValueDescription
replication.factor3-Number of replicas for each partition
acksallall, default prior to Kafka 3.0: 1Number of acknowledgments required
enable.idempotencetruetrue, default prior to Kafka 3.0: falseEnable exactly-once delivery semantics
max.in.flight.requests.per.connection15Maximum number of unacknowledged requests

Consumer Configurations

ConfigurationRecommended ValueDefault ValueDescription
enable.auto.commitfalsetrueEnable automatic offset commits
isolation.levelread_committed (Ursa Engine doesn't support read_committed yet)read_uncommitted for Java client and read_committed for librdkafka based clientsTransaction isolation level for consumers
Previous
Latency