Get started with Lakehouse storage

Note

This feature is currently in private preview. If you want to try it out or have any questions, submit a ticket to the support team.

Introduction

This guide provides a comprehensive walkthrough on enabling Lakehouse Storage on StreamNative Cloud. You will learn how to offload data to Lakehouse products, perform streaming reads, and execute batch reads seamlessly.

Step 1: Enable Lakehouse Storage on StreamNative Cloud

Lakehouse Storage is a new feature that requires manual activation, and only available in the Rapid release channel. Follow these steps to enable it:

Contact the StreamNative support team to enable the Lakehouse Storage feature for your account.
When creating a new Pulsar cluster, ensure to select the Rapid release channel.
For existing Pulsar clusters, contact the StreamNative support team to switch to the Rapid release channel.

Release Channel

Lakehouse Storage currently supports Delta Lake with open table formats. By default, data is offloaded to a Delta Lake table. Data storage is supported on AWS S3 and GCP GCS, depending on the cloud provider where the cluster is running.

Cloud Provider

Upon completing the steps above, Lakehouse Storage will be available on your Pulsar cluster. However, it requires manual activation at the namespace level by setting the offload threshold.

Example:

bin/pulsar-admin namespaces set-offload-threshold public/default --size 0

To disable Lakehouse Storage, set the threshold to -1:

bin/pulsar-admin namespaces set-offload-threshold public/default --size -1 --time -1

Step 2: Make Data available in Lakehouse

Once Lakehouse Storage is enabled for your namespace or cluster, data can be automatically written in lakehouse table formats when you produce messages to your topic. Currently, the feature supports only topics with AVRO schema or primitive schema.

For testing purposes, use pulsar perf to produce messages to a topic:

bin/pulsar-perf produce -r 1000 -u pulsar://localhost:6650 persistent://public/default/test-topic

Check if data has been offloaded by examining the topic's internal stats using pulsar-admin or StreamNative Cloud console:

bin/pulsar-admin topics stats-internal persistent://public/default/test-topic

Offload Cursor

Topic Stats Internal

Step 3: Read Data from Lakehouse

Even though data is written in Lakehouse table formats, you can continue to read the topics using Kafka or Pulsar's API. Alternatively, if you want to replay all the data, you can use the Lakehouse table to read the data directly from the Lakehouse using any analytical tools that are compatible with Lakehouse table formats, from Snowflake and Databricks to Athena, Flink, and Spark.

Streaming Read

You can continue to use the Pulsar or Kafka API to read data from the topic even when the data is written in Lakehouse table formats.

For example, you can create a reader/consumer using the Pulsar client to read data seamlessly.

For testing, consume messages from a topic using pulsar perf:

bin/pulsar-perf consume -ss test_sub -sp Earliest -u pulsar://localhost:6650 persistent://public/default/test-topic

Read as a Lakehouse table

Serverless & Dedicated Cluster

Currently, the topics are not exposed as Lakehouse tables for user access. This feature is coming soon.

BYOC Cluster

In BYOC (Bring Your Own Cloud) clusters, the data is stored in the customer's S3 or GCS buckets. Once Lakehouse Storage is enabled, the data will be stored as Lakehouse tables in those buckets. You can use Spark SQL, Athena, Trino, or other tools to read the Lakehouse tables from these buckets.

Catalog support is coming soon.

Disable Lakehouse Storage

If you want to disable the Lakehouse Storage feature, set the offload threshold to -1 using pulsar-admin:

bin/pulsar-admin namespaces set-offload-threshold public/default --size -1 --time -1

After disabling Lakehouse Storage, data will no longer be written as Lakehouse tables. Ensure that you do not delete offload configurations or Lakehouse tables to prevent data loss.

Limitations

The limitations of Lakehouse Storage include:

Schema evolution does not support field deletion
Supported schema types vary for different table formats
- We only support AVRO and Pulsar primitive schema for Delta Lake tables
Inability to configure different table formats for distinct namespaces or topics

Delta Lake-specific limitations:

Avoid using Spark for small file compaction to prevent interference with streaming reads from Delta tables. Separate support for the compaction feature is available.

Notice

In Lakehouse Storage, the RawReader interface is utilized to retrieve messages from Pulsar topics and write them to the Lakehouse. A critical component of this process is the offload cursor, which marks the progress of the offloading operation. It's crucial to note that prematurely advancing the offload cursor can lead to data loss in writing to lakehouse tables.

When configuring Time-to-Live (TTL) settings at the namespace or topic level, consider the following two key options:

Do Not Configure TTL at Namespace or Topic Level:

This option ensures that no TTL constraints are imposed at the namespace or topic level, allowing data to persist without automatic expiration based on time.

Configure TTL at Namespace or Topic Level:

If TTL settings are implemented at the namespace or topic level, it is crucial to ensure that the TTL value is greater than the retention policy set for the topic or namespace.

By carefully considering these options and aligning TTL configurations with retention policies, organizations can effectively manage the data lifecycle and retention within the Lakehouse Storage, thereby safeguarding against unintended data loss scenarios.