1. Configure Private Cloud
  2. Advanced

Graceful Cluster Rollout

Warning

This feature is currently in Private Preview and is not recommended for production environments. It may undergo significant changes as we refine and improve it based on feedback. Please use it with caution and report any issues or suggestions to help us enhance its functionality.

Note: Private Preview features may have limited support and could include known or unknown issues. Ensure you thoroughly test its behavior in a non-production environment before integrating it into critical workflows.

Graceful Cluster Rollout is a feature introduced in StreamNative Operator to simplify and streamline the upgrade process for Pulsar clusters. This feature minimizes the impact of Broker restarts on business operations, enabling seamless Pulsar cluster upgrades with minimal disruption.

How does it work?

Traditional Pulsar cluster upgrades relied on an in-place upgrade strategy that required sequentially restarting each Broker to deploy new versions. This approach was disruptive since it caused topic ownership to transfer repeatedly between brokers, resulting in frequent client reconnections that could destabilize the cluster.

Graceful Cluster Rollout introduces a new upgrade approach that uses two separate StatefulSets to smoothly transition between broker versions. The broker StatefulSet names now include an 8-character revision hash (for example, <cluster-name>-<revision-hash>) that represents the current version of the PulsarBroker CRD. When the PulsarBroker CRD spec is modified, a new revision hash is generated and triggers a controlled cluster upgrade through these steps:

  1. The StreamNative Operator creates a new broker StatefulSet with the updated revision hash, running the new configuration specified in the PulsarBroker CRD while maintaining the existing cluster's operation.
  2. Using Pulsar's Topic Unloading mechanism, the Operator gradually and safely migrates traffic from the old broker pods to the new ones.
  3. After confirming successful traffic migration, the Operator cleanly terminates the old broker pods.

This approach eliminates the need for disruptive in-place restarts and provides fine-grained control over the rollout speed, allowing organizations to minimize impact on their business operations.

The system maintains a history of changes by creating a PulsarBrokerRevision CRD for each modification to the PulsarBroker CRD spec. This enables administrators to monitor revision status and roll back to previous versions if needed.

The feature also includes support for canary deployments, allowing teams to validate upgrades on a subset of Brokers before proceeding with a full cluster rollout or executing a rollback if issues are detected.

Prerequisites

  • Cluster Version: 4.0.0 or higher.
  • StreamNative Operator Version: 0.7.7 or higher.

Limitations

Due to the architectural changes in broker management introduced by Graceful Cluster Rollout, this feature is currently only compatible with Pulsar Clusters that are created using StreamNative Operator version 0.7.7 or higher. Support for enabling this feature on Pulsar clusters created with earlier versions of StreamNative Operator is planned for future releases.

Create a Pulsar Cluster with Graceful Cluster Rollout

To create a new Pulsar cluster with Graceful Cluster Rollout, add the annotation cloud.streamnative.io/enable-revision: 'true' to the PulsarBroker CRD metadata. This enables the feature during cluster creation.

apiVersion: pulsar.streamnative.io/v1alpha1
kind: PulsarBroker
metadata:
  annotations:
    # Enable Graceful Cluster Rollout
    cloud.streamnative.io/enable-revision: 'true'
  name: <your-cluster-name>
  namespace: <your-namespace>
# ...

Customize Traffic Migration Speed

By default, the Operator migrates traffic during cluster rollout using the following parameters:

  • Traffic migration rate: 10% of topics per interval
  • Migration interval: 60 seconds between migrations
  • Total rollout time: Approximately 10 minutes per broker

You can customize these parameters using the following annotations:

  • cloud.streamnative.io/topic-unloading-speed: Controls what percentage of traffic to migrate in each interval (default: "0.1", means 10%).
  • cloud.streamnative.io/topic-unloading-interval-second: Sets the time in seconds between migration intervals (default: 60)

Here's an example configuration:

apiVersion: pulsar.streamnative.io/v1alpha1
kind: PulsarBroker
metadata:
  annotations:
    # Enable Graceful Cluster Rollout
    cloud.streamnative.io/enable-revision: 'true'
    # Configure migration speed: 10% every 60 seconds
    cloud.streamnative.io/topic-unloading-speed: '0.1'
    cloud.streamnative.io/topic-unloading-interval-second: '60'
  name: demo
  namespace: default
# ...

Canary Deployment and Rollback

You can perform a canary deployment by configuring the spec.updateStrategy.partition field in the PulsarBroker CRD. This field controls which brokers get updated during a rollout. The brokers with an ordinal number greater than or equal to the specified partition value will be updated. By default, the partition value is 0, meaning all brokers will be updated.

apiVersion: pulsar.streamnative.io/v1alpha1
kind: PulsarBroker
metadata:
  annotations:
    cloud.streamnative.io/enable-revision: 'true'
  name: demo
  namespace: default
spec:
  updateStrategy:
    # Brokers with ordinal >= partition will be updated
    # Default is 0, meaning all Brokers will be updated
    partition: 2
# ...

If you need to rollback to the previous version during the canary deployment, you can:

  1. Restore the previous configuration of the PulsarBroker CRD.
  2. Modify the spec.updateStrategy.partition field to 0.

The Operator will then migrate traffic from the new brokers to the old ones, and finally terminate the new brokers.

History Revisions

The Operator maintains a history of revisions by creating a PulsarBrokerRevision CRD for each modification to the PulsarBroker CRD spec. You can check the revision history of a Pulsar cluster by running the following command:

kubectl get pulsarbrokerrevision -n <namespace>

Revision Retention

The Operator retains the revision history for a specified number of revisions. By default, up to three historical snapshots are retained. You can configure the retention limit by specifying the spec.updateStrategy.revisionHistoryLimit field in the PulsarBroker CRD.

Troubleshooting

Follow these steps to monitor and troubleshoot the cluster rollout process:

Monitor Rollout Progress

You can monitor the rollout progress in two ways:

  • Logs: Monitor the Broker and StreamNative Operator logs for messages containing keywords like Unloading to track the rollout progress.
  • Metrics: Check the Grafana dashboard to observe topic migration - you should see the number of topics increasing on new Brokers while decreasing on old Brokers.

Skip Traffic Migration

If you encounter issues during traffic migration, the upgrade process may get stuck. You can add the cloud.streamnative.io/skip-topic-unload: "true" annotation to skip traffic migration. Apply this annotation to:

  • Pod: Skip traffic migration for an individual Broker
  • PulsarBrokerRevision: Skip traffic migration for the current rolling upgrade
  • PulsarBroker: Skip traffic migration for the current rolling upgrade, and disable it in the future upgrade.

Here's an example of the annotation:

metadata:
  annotations:
    cloud.streamnative.io/skip-topic-unload: 'true'
# ...
Previous
Log Format