1. Operating StreamNative Platform

Configure storage

This document describes how to configure storage for StreamNative Platform.

Persistent storage volumes

You can use the local PersistentVolume (PV) and storage classes and default Kubernetes Storage Classes to provision persistent storage of your data.

Local PVs and storage classes

Note

If you deploy a local Kubernetes cluster, you need to configure the local PVs and storage classes for persisting data to your local storage.

Pulsar cluster components such as the BookKeeper and ZooKeeper require the persistent storage of data. To persist data in Kubernetes, you need to use PVs. A PV contains the details of the storage that is available for use by the Pulsar cluster. A PV can be provisioned by an administrator statically or dynamically using StorageClass. A StorageClass provides a way for administrators to describe the "classes" of storage they offer. Different classes might map to Quality-of-Service (QoS) levels, to backup policies, or to arbitrary policies determined by the cluster administrators.

PVs and Pods are bound by PersistentVolumeClaim (PVC). A PersistentVolumeClaim (PVC) is a request for storage by a user. It is similar to a Pod. Pods consume node resources and PVCs consume PV resources.

To configure the local PVs and storage classes, follow these steps.

  1. Preallocate local storage in each cluster node.

    The example creates five Solid State Drive (SSD) and Hybrid Hard Drive (HDDs) volumes respectively.

    Note

    This code example is just for the test environment. You can configure your local storage based on your production environment.

    #!/bin/bash
    for i in $(seq 1 5); do
      mkdir -p /mnt/ssd-bind/vol${i}
      mkdir -p /mnt/ssd/vol${i}
      mount --bind /mnt/ssd-bind/vol${i} /mnt/ssd/vol${i}
    done
    for i in $(seq 1 5); do
      mkdir -p /mnt/hdd-bind/vol${i}
      mkdir -p /mnt/hdd/vol${i}
      mount --bind /mnt/hdd-bind/vol${i} /mnt/hdd/vol${i}
    done
    
  2. Install the local volume provisioner.

    Note

    The local volume provisioner manages the PV lifecycle for pre-allocated disks by detecting and creating PVs for each local disk on the host, and then cleaning up the disks when released. It does not support dynamic provisioning.

    a. Define a YAML file to configure the local volume provisioner.

    Here is an example of the YAML file used for configuring the local volume provisioner.

    b. Apply the YAML file to install the local volume provisioner.

    kubectl apply -f /path/to/local-volume-provisioner/file.yaml

  3. Verify that the local volume provisioner is created successfully.

    kubectl get po -n <k8s_namespace> |grep local-volume
    
  4. Verify that all PVs are created successfully.

    kubectl get pv
    
  5. Verify that all storage classes are created successfully.

    kubectl get storageclasses
    

Kubernetes default StorageClass

If you do not provide the volumes.data.storageClassName in the values.yaml YAML file, the Pulsar operator uses the default storage class.

Use the command below to get the name of the current default storage class:

kubectl get sc

To use the Kubernetes default storage class, it is recommended to set the following properties on the default StorageClasses.

  • volumeBindingMode: WaitForFirstConsumer
  • reclaimPolicy: Retain
  • allowVolumeExpansion: true (required field for the production deployments)

Multiple volumes

Pulsar uses Apache BookKeeper for persistent message storage. The BookKeeper server (bookie) uses ledgers and journals to manage data updates and transaction logs. BookKeeper supports concurrent writes. To persist data storage and avoid data loss, you can configure multiple volumes as well as directories of each volume for journals and ledgers in the values.yaml YAML file.

  • Volume configurations for journals

    bookkeeper:
      volumes:
        # use a persistent volume or emptyDir
        persistence: true
        journal:
          # It determines the directory of journal data
          numVolumes: 1 # --- [1]
          numDirsPerVolume: 1 # --- [2]
    
    • [1] numVolumes: the number of volumes supported for each journal.
    • [2] numDirsPerVolume: the number of directories BookKeeper outputs its Write-Ahead Logs (WAL) to.
  • Volume configurations for ledgers

    bookkeeper:
      volumes:
        # use a persistent volume or emptyDir
        persistence: true
        ledgers:
          name: ledgers
          size: 50Gi
          # It determines the directory of ledgers data
          numVolumes: 1 # --- [1]
          numDirsPerVolume: 1 # --- [2]
    
    • [1] numVolumes: the number of volumes supported for each ledger.
    • [2] numDirsPerVolume: the number of directories BookKeeper outputs ledger snapshots to.

Tiered Storage

Tiered Storage makes storing huge volumes of data in Pulsar manageable by reducing operational burden and cost. The fundamental idea is to separate data storage from data processing, allowing each to scale independently. With Tiered Storage, you can send data to cost-effective object storage, and scale brokers only when you need more compute resources.

StreamNative Platform supports the following object storage solutions for Tiered Storage:

  • AWS S3
  • Google Cloud Storage
  • Azure Blob Storage

Enable Tiered Storage

Starting from StreamNative Platform 1.3.0, you can enable Tiered Storage by setting broker.offload.enabled=true. When you enable Tiered Storage, you need to configure the type of blob storage to use and its related properties, such as the bucket / container, the region, and the credentials.

When a Pulsar cluster is deleted, StreamNative Platform does not perform a garbage collection of the Tiered Storage bucket contents. You can either wait for the set deletion interval or manually delete the objects in the Tiered Storage bucket.

To disable Tiered Storage, you can set broker.offload.enabled=false.

Configure Tiered Storage for AWS S3

Before enabling Tiered Storage on Amazon Web Services (AWS) with Amazon Simple Storage Service (S3 buckets), you need to configure the following:

  • Generate an AWS access key and secret access key.

  • Create an AWS S3 bucket.

  • Create a Kubernetes secret to save your AWS credentials with the command below. When you configure Tiered Storage, you can specify the Kubernetes secret. Pulsar brokers use the credentials stored in the Kubernetes secret to access the storage container. When your storage credentials change, you need to restart the Pulsar cluster.

    kubectl -n <k8s_namespace> create secret generic \
      --from-literal=AWS_ACCESS_KEY_ID=<aws_access_key> \
      --from-literal=AWS_SECRET_ACCESS_KEY=<aws_secret_key> \
      [secret name]
    

To enable Tiered Storage for AWS S3, set the following fields in the values.yaml YAML file:

broker:
  offload:
    enabled: true
    managedLedgerMinLedgerRolloverTimeMinutes: ''
    managedLedgerMaxEntriesPerLedger: ''
    managedLedgerOffloadDriver: ''
    s3:
      enabled: true
      s3ManagedLedgerOffloadRegion: '[YOUR REGION OF S3]'
      s3ManagedLedgerOffloadBucket: '[YOUR BUCKET OF S3]'
      s3ManagedLedgerOffloadMaxBlockSizeInBytes: ''
      s3ManagedLedgerOffloadReadBufferSizeInBytes: ''
      s3ManagedLedgerOffloadServiceEndpoint: ''
      secret: '[the name of the created Kubernetes secret]'

This table outlines fields available for configuring Tiered Storage for AWS S3.

FieldDescriptionDefaultRequired or not
broker.offload.enableEnable Tiered Storage.falseRequired
broker.offload.managedLedgerMinLedgerRolloverTimeMinutesThe minimum time between ledger rollovers for a topic.
It is not recommended to set this field in the production environment.
10Optional
broker.offload.managedLedgerMaxEntriesPerLedgerThe Maximum number of entries to append to a ledger before triggering a rollover.
it is not recommended to set this field in the production environment.
50000Optional
broker.offload.managedLedgerOffloadDriverThe offloader driver name, which is case-insensitive.
There is a third driver type (S3), which is identical to AWS S3. S3 requires you to specify an endpoint URL using the s3ManagedLedgerOffloadServiceEndpoint field. This is useful if you use an S3-compatible data store other than AWS S3.
aws-s3Required
broker.offload.s3.enabledEnable Tiered Storage for AWS S3.falseRequired
broker.offload.s3.s3ManagedLedgerOffloadRegionThe AWS S3 bucket region.
Before specifying a value for this parameter, you need to perform the following operations. Otherwise, you might get an error.
- Set s3ManagedLedgerOffloadServiceEndpoint, such as s3ManagedLedgerOffloadServiceEndpoint=https://s3.YOUR_REGION.amazonaws.com.
- Grant GetBucketLocation permission to a user. For details about how to grant GetBucketLocation permission to a user, see bucket operations.
N/AOptional
broker.offload.s3.s3ManagedLedgerOffloadBucketThe AWS S3 bucket.N/ARequired
broker.offload.s3.s3ManagedLedgerOffloadMaxBlockSizeInBytesThe maximum size of a block that is sent when a multi-block is uploaded to AWS S3. It cannot be smaller than 5 MB.64 MBRequired
broker.offload.s3.s3ManagedLedgerOffloadReadBufferSizeInBytesThe block size for each individual read when reading data from AWS S3.1 MBRequired
broker.offload.s3.s3ManagedLedgerOffloadServiceEndpointAn alternative AWS S3 endpoint to connect to (for test purpose).N/ARequired
broker.offload.s3.secretThe Kubernetes secret that stores the AWS credentials.N/ARequired

Configure Tiered Storage for Google GCS

Before enabling Tiered Storage with Google Cloud Storage (GCS), you need to configure the following:

  • Create a GCS service account.

  • Create a GCS bucket.

  • Create a Kubernetes secret to save your Google credentials with the following command. When you configure Tiered Storage, you can specify the Kubernetes secret. Pulsar brokers use the credentials stored in the Kubernetes secret to access the storage container. When your storage credentials change, you need to restart the Pulsar cluster.

    kubectl -n <k8s_namespace> create secret generic \
      --from-file=<gcs_service_account_path> \
      [secret name]
    

To enable Tiered Storage for Google Cloud Storage, set the following fields in the values.yaml YAML file:

broker:
  offload:
    enabled: true
    managedLedgerMinLedgerRolloverTimeMinutes: ''
    managedLedgerMaxEntriesPerLedger: ''
    managedLedgerOffloadDriver:
    gcs:
      enabled: true
      gcsManagedLedgerOffloadRegion: '[YOUR REGION OF GCS]'
      gcsManagedLedgerOffloadBucket: '[YOUR BUCKET OF GCS]'
      gcsManagedLedgerOffloadMaxBlockSizeInBytes: ''
      gcsManagedLedgerOffloadReadBufferSizeInBytes: ''
      secret: '[the name of the created Kubernetes secret]'

This table outlines fields available for configuring Tiered Storage for Google Cloud Storage.

FieldDescriptionDefaultRequired or not
broker.offload.enableEnable Tiered Storage.falseRequired
broker.offload.managedLedgerMinLedgerRolloverTimeMinutesThe minimum time between ledger rollovers for a topic.
It is not recommended to set this field in the production environment.
10Optional
broker.offload.managedLedgerMaxEntriesPerLedgerThe Maximum number of entries to append to a ledger before triggering a rollover.
it is not recommended to set this field in the production environment.
50000Optional
broker.offload.managedLedgerOffloadDriverThe offloader driver name, which is case-insensitive.google-cloud-storageRequired
broker.offload.gcs.enabledEnable Tiered Storage for Google GCS.falseRequired
broker.offload.gcs.gcsManagedLedgerOffloadRegionThe Google Cloud Storage bucket region.N/ARequired
broker.offload.gcs.gcsManagedLedgerOffloadBucketThe Google Cloud Storage bucket.N/ARequired
broker.offload.gcs.gcsManagedLedgerOffloadMaxBlockSizeInBytesThe maximum size of a block that is sent when a multi-block is uploaded to Google Cloud Storage. It cannot be smaller than 5 MB.64 MBOptional
broker.offload.gcs.gcsManagedLedgerOffloadReadBufferSizeInBytesThe block size for each individual read when reading data from Google Cloud Storage.1 MBOptional
broker.offload.gcs.secretThe Kubernetes secret that stores the Google credentials.N/ARequired

Configure Tiered Storage for Azure Blob Storage

Before enabling Tiered Storage with Azure Blob Storage, you need to configure the following:

  • Create an Azure storage account and a storage account access key.

  • Create an Azure Blob container.

  • Create a Kubernetes secret to save your Azure credentials with the command below. When you configure Tiered Storage, you can specify the Kubernetes secret. Pulsar brokers use the credentials stored in the Kubernetes secret to access the storage container. When your storage credentials change, you need to restart the Pulsar cluster.

    kubectl -n <k8s_namespace> create secret generic \
      --from-literal=AZURE_STORAGE_ACCOUNT=<azure_storage_account> \
      --from-literal=AZURE_STORAGE_ACCESS_KEY=<azure_storage_access_key> \
      [secret name]
    

To enable Tiered Storage for Azure Blob Storage, set the following fields in the values.yaml YAML file:

broker:
  offload:
    enabled: true
    managedLedgerMinLedgerRolloverTimeMinutes: ''
    managedLedgerMaxEntriesPerLedger: ''
    managedLedgerOffloadDriver:
    azureblob:
      enabled: true
      managedLedgerOffloadBucket: '[YOUR BLOB CONTAINER]'
      managedLedgerOffloadMaxBlockSizeInBytes: ''
      managedLedgerOffloadReadBufferSizeInBytes: ''
      managedLedgerOffloadServiceEndpoint: ''
      secret: '[the name of the created Kubernetes secret]'

This table outlines fields available for configuring Tiered Storage for Azure Blob Storage.

FieldDescriptionDefaultRequired or not
broker.offload.enableEnable Tiered Storage.falseRequired
broker.offload.managedLedgerMinLedgerRolloverTimeMinutesThe minimum time between ledger rollovers for a topic.
It is not recommended to set this field in the production environment.
10Optional
broker.offload.managedLedgerMaxEntriesPerLedgerThe Maximum number of entries to append to a ledger before triggering a rollover.
It is not recommended that you set this field in the production environment.
50000Optional
broker.offload.managedLedgerOffloadDriverThe offloader driver name, which is case-insensitive.azureblobRequired
broker.offload.azureblob.enabledEnable Tiered Storage for Azure Blob Storage.falseRequired
broker.offload.azureblob.managedLedgerOffloadBucketThe Azure Blob container.N/ARequired
broker.offload.azureblob.managedLedgerOffloadMaxBlockSizeInBytesThe maximum size of a block that is sent when a multi-block is uploaded to Azure Blob Storage. It cannot be smaller than 5 MB.64 MBOptional
broker.offload.azureblob.managedLedgerOffloadReadBufferSizeInBytesThe block size for each individual read when reading data from Azure Blob Storage.1 MBOptional
broker.offload.azureblob.managedLedgerOffloadServiceEndpointAn alternative Azure Blob Storage endpoint to connect to (for test purpose).N/ARequired
broker.offload.azureblob.secretThe Kubernetes secret that stores the Azure credentials.N/ARequired
Previous
Configure Broker Load Balancer