- Configure Private Cloud
- Private Preview
- Lakehouse Tiered Storage
Get started with Lakehouse tiered storage
Note
This feature is currently in private preview. If you want to try it out or have any questions, submit a ticket to the support team.
Tutorial Overview
This tutorial provides a comprehensive walkthrough on enabling Lakehouse tiered storage on StreamNative Private Cloud. It covers essential steps such as offloading data to Lakehouse products, streaming reads from Lakehouse, and batch reads from Lakehouse, offering a holistic understanding of the integration.
Step 1: Enable Lakehouse Tiered Storage on Private Cloud
Prerequisites
Before activating the Lakehouse tiered storage feature, ensure you have a prepared S3 or GCS bucket with necessary permissions for your AWS or GCP account. Follow the AWS S3 or GCP GCS documentation to create the bucket and grant appropriate permissions.
Activation Process
To enable Lakehouse tiered storage, configure the conf/broker.conf
and conf/offload.conf
files within the Pulsar broker environment.
In conf/broker.conf
, include the following configurations:
Configuration | Description | Required | Default Value |
---|---|---|---|
managedLedgerOffloadDriver | Case-insensitive offloader driver name (e.g., delta or iceberg ) | Yes | N/A |
In conf/offload.conf
, add the following configurations:
Configuration | Description | Required | Default Value |
---|---|---|---|
metadataServiceUri | Metadata service URI for BookKeeper client (e.g., zk://localhost:2181/ledgers ) | Yes | N/A |
pulsarWebServiceUrl | Pulsar web service URL (e.g., http://localhost:8080 ) | Yes | N/A |
pulsarServiceUrl | Pulsar protocol service URL (e.g., pulsar://localhost:6650 ) | Yes | N/A |
offloadProvider | Offloader driver's name (e.g., delta or iceberg ) | Yes | N/A |
storagePath | Storage path (e.g., s3a://bucket-name/prefix or gs://bucket-name/prefix ) | No | data |
googleCloudProjectID | GCS offload project configuration. For example: example-project | Required if offloading data to GCS | N/A |
googleCloudServiceAccountFile | GCS offload authentication. For example: /Users/user-name/Downloads/project-804d5e6a6f33.json | Required if offloading data to GCS | N/A |
If you want to offload data to S3, you need to export AWS_ACCESS_KEY_ID
and AWS_SECRET_KEY
before starting up the broker service and setting the storagePath
with s3s://
prefix.
Lakehouse product like Delta Lake is supported, with data typically written in parquet format with snappy compression. Specify the offload provider (delta
or iceberg
) in conf/offload.conf
to choose the Lakehouse product.
Upon completing these configurations, start the Pulsar broker to initiate the Lakehouse tiered storage offload service.
After finished the above steps, the Lakehouse tiered storage feature will be support on your Pulsar cluster. But this feature is still not enabled by default on the Pulsar cluster, you need to enable it on namespace level by setting namespace offload threshold. For example: Set the offload threshold to 0 by pulsar-admin, which means all the data will be offloaded to Lakehouse immediately.
bin/pulsar-admin namespaces set-offload-threshold public/default --size 0
If you want to disable the offload, you can set the offload threshold to -1.
bin/pulsar-admin namespaces set-offload-threshold public/default --size -1 --time -1
Step 2: Offload Data to Lakehouse
Once Lakehouse tiered storage is enabled for your namespace or cluster, produce messages to your topic for automatic offloading to the Lakehouse table. Note that the current support is limited to AVRO schema and Pulsar primitive schema, with other schemas under development.
For testing, use pulsar perf
to produce messages to a topic:
bin/pulsar-perf produce -r 1000 -u pulsar://localhost:6650 persistent://public/default/test-topic
Check if the topic data has been offloaded to Lakehouse by examining the topic's internal stats using pulsar-admin
:
bin/pulsar-admin topics stats-internal persistent://public/default/test-topic
From the topic stats internal, you can find the __OFFLOAD
cursor if the offload process started
To check if the ledger is offloaded or not, you can check the offloaded
flag, if the ledger has been offloaded to Lakehouse, the offloaded
flag will be set to true
.
Note: The topic offload processor is triggered by ledger rollover, after the offload process triggered, it will offload the following ledgers in streaming mode and do not need to wait for ledger rollover. So when you produce messages to the topic and the first ledger not rolled over, the offload process will not start.
Step 3: Read Data from Lakehouse
After offloading data to Lakehouse, you can read it using the Pulsar reader/consumer API or the Lakehouse product API.
Streaming Read from Lakehouse
Utilize the Pulsar reader/consumer API to access data from Lakehouse, similar to reading from a Pulsar topic. For testing, consume messages from a topic using pulsar perf
:
bin/pulsar-perf consume -ss test_sub -sp Earliest -u pulsar://localhost:6650 persistent://public/default/test-topic
Batch Read from Lakehouse
You can use Spark SQL, Athena, Trino, or other tools to read Delta table data from the S3 or GCS bucket. The bucket is owned by the user, and the data is stored in the bucket.
Disable Lakehouse Tiered Storage
If you want to disable the Lakehouse tiered storage feature, set the offload threshold to -1 using pulsar-admin
:
bin/pulsar-admin namespaces set-offload-threshold public/default --size -1 --time -1
Upon disabling, the topic data will no longer be offloaded to Lakehouse. Ensure you retain offload-related configurations and the Lakehouse table to prevent data loss.
Demo
Watch a demo video showcasing data offloading to Lakehouse and data retrieval from Lakehouse.