1. Configure Private Cloud
  2. Private Preview
  3. Lakehouse Tiered Storage

Get started with Lakehouse tiered storage

Note

This feature is currently in private preview. If you want to try it out or have any questions, submit a ticket to the support team.

Tutorial Overview

This tutorial provides a comprehensive walkthrough on enabling Lakehouse tiered storage on StreamNative Private Cloud. It covers essential steps such as offloading data to Lakehouse products, streaming reads from Lakehouse, and batch reads from Lakehouse, offering a holistic understanding of the integration.

Step 1: Enable Lakehouse Tiered Storage on Private Cloud

Prerequisites

Before activating the Lakehouse tiered storage feature, ensure you have a prepared S3 or GCS bucket with necessary permissions for your AWS or GCP account. Follow the AWS S3 or GCP GCS documentation to create the bucket and grant appropriate permissions.

Activation Process

To enable Lakehouse tiered storage, configure the conf/broker.conf and conf/offload.conf files within the Pulsar broker environment.

In conf/broker.conf, include the following configurations:

ConfigurationDescriptionRequiredDefault Value
managedLedgerOffloadDriverCase-insensitive offloader driver name (e.g., delta or iceberg)YesN/A

In conf/offload.conf, add the following configurations:

ConfigurationDescriptionRequiredDefault Value
metadataServiceUriMetadata service URI for BookKeeper client (e.g., zk://localhost:2181/ledgers)YesN/A
pulsarWebServiceUrlPulsar web service URL (e.g., http://localhost:8080)YesN/A
pulsarServiceUrlPulsar protocol service URL (e.g., pulsar://localhost:6650)YesN/A
offloadProviderOffloader driver's name (e.g., delta or iceberg)YesN/A
storagePathStorage path (e.g., s3a://bucket-name/prefix or gs://bucket-name/prefix)Nodata
googleCloudProjectIDGCS offload project configuration. For example: example-projectRequired if offloading data to GCSN/A
googleCloudServiceAccountFileGCS offload authentication. For example: /Users/user-name/Downloads/project-804d5e6a6f33.jsonRequired if offloading data to GCSN/A

If you want to offload data to S3, you need to export AWS_ACCESS_KEY_ID and AWS_SECRET_KEY before starting up the broker service and setting the storagePath with s3s:// prefix.

Lakehouse product like Delta Lake is supported, with data typically written in parquet format with snappy compression. Specify the offload provider (delta or iceberg) in conf/offload.conf to choose the Lakehouse product.

Upon completing these configurations, start the Pulsar broker to initiate the Lakehouse tiered storage offload service.

After finished the above steps, the Lakehouse tiered storage feature will be support on your Pulsar cluster. But this feature is still not enabled by default on the Pulsar cluster, you need to enable it on namespace level by setting namespace offload threshold. For example: Set the offload threshold to 0 by pulsar-admin, which means all the data will be offloaded to Lakehouse immediately.

bin/pulsar-admin namespaces set-offload-threshold public/default --size 0

If you want to disable the offload, you can set the offload threshold to -1.

bin/pulsar-admin namespaces set-offload-threshold public/default --size -1 --time -1

Step 2: Offload Data to Lakehouse

Once Lakehouse tiered storage is enabled for your namespace or cluster, produce messages to your topic for automatic offloading to the Lakehouse table. Note that the current support is limited to AVRO schema and Pulsar primitive schema, with other schemas under development.

For testing, use pulsar perf to produce messages to a topic:

bin/pulsar-perf produce -r 1000 -u pulsar://localhost:6650 persistent://public/default/test-topic

Check if the topic data has been offloaded to Lakehouse by examining the topic's internal stats using pulsar-admin:

bin/pulsar-admin topics stats-internal persistent://public/default/test-topic

From the topic stats internal, you can find the __OFFLOAD cursor if the offload process started Offload cursor

To check if the ledger is offloaded or not, you can check the offloaded flag, if the ledger has been offloaded to Lakehouse, the offloaded flag will be set to true.

Topic stats internal

Note: The topic offload processor is triggered by ledger rollover, after the offload process triggered, it will offload the following ledgers in streaming mode and do not need to wait for ledger rollover. So when you produce messages to the topic and the first ledger not rolled over, the offload process will not start.

Step 3: Read Data from Lakehouse

After offloading data to Lakehouse, you can read it using the Pulsar reader/consumer API or the Lakehouse product API.

Streaming Read from Lakehouse

Utilize the Pulsar reader/consumer API to access data from Lakehouse, similar to reading from a Pulsar topic. For testing, consume messages from a topic using pulsar perf:

bin/pulsar-perf consume -ss test_sub -sp Earliest -u pulsar://localhost:6650 persistent://public/default/test-topic

Batch Read from Lakehouse

You can use Spark SQL, Athena, Trino, or other tools to read Delta table data from the S3 or GCS bucket. The bucket is owned by the user, and the data is stored in the bucket.

Disable Lakehouse Tiered Storage

If you want to disable the Lakehouse tiered storage feature, set the offload threshold to -1 using pulsar-admin:

bin/pulsar-admin namespaces set-offload-threshold public/default --size -1 --time -1

Upon disabling, the topic data will no longer be offloaded to Lakehouse. Ensure you retain offload-related configurations and the Lakehouse table to prevent data loss.

Demo

Watch a demo video showcasing data offloading to Lakehouse and data retrieval from Lakehouse.

Previous
Overview