This guide provides a comprehensive walkthrough on enabling Lakehouse Storage on StreamNative Cloud. You will learn how to offload data to Lakehouse products, perform streaming reads, and execute batch reads seamlessly.
Step 1: Enable Lakehouse Storage on StreamNative Cloud
Lakehouse Storage is a new feature that requires manual activation, and only available in the Rapid release channel. Follow these steps to enable it:
Contact the StreamNative support team to enable the Lakehouse Storage feature for your account.
When creating a new Pulsar cluster, ensure to select the Rapid release channel.
For existing Pulsar clusters, contact the StreamNative support team to switch to the Rapid release channel.
Lakehouse Storage currently supports Delta Lake with open table formats. By default, data is offloaded to a Delta Lake table. Data storage is supported on AWS S3 and GCP GCS, depending on the cloud provider where the cluster is running.Upon completing the steps above, Lakehouse Storage will be available on your Pulsar cluster. However, it requires manual activation at the namespace level by setting the offload threshold.Example:
Once Lakehouse Storage is enabled for your namespace or cluster, data can be automatically written in lakehouse table formats when you produce messages to your topic. Currently, the feature supports only topics with AVRO schema or primitive schema.For testing purposes, use pulsar perf to produce messages to a topic:
Copy
Ask AI
bin/pulsar-perf produce -r 1000 -u pulsar://localhost:6650 persistent://public/default/test-topic
Check if data has been offloaded by examining the topic’s internal stats using pulsar-admin or StreamNative Cloud console:
Even though data is written in Lakehouse table formats, you can continue to read the topics using Kafka or Pulsar’s API. Alternatively, if you want to replay all the data, you can use the Lakehouse table to read the data directly from the Lakehouse using any analytical tools that are compatible with Lakehouse table formats, from Snowflake and Databricks to Athena, Flink, and Spark.
You can continue to use the Pulsar or Kafka API to read data from the topic even when the data is written in Lakehouse table formats.For example, you can create a reader/consumer using the Pulsar client to read data seamlessly.For testing, consume messages from a topic using pulsar perf:
Currently, the topics are not exposed as Lakehouse tables for user access. This feature is coming soon.
BYOC Cluster
In BYOC (Bring Your Own Cloud) clusters, the data is stored in the customer’s S3 or GCS buckets. Once Lakehouse Storage is enabled, the data will be stored as Lakehouse tables in those buckets. You can use Spark SQL, Athena, Trino, or other tools to read the Lakehouse tables from these buckets.Catalog support is coming soon.
After disabling Lakehouse Storage, data will no longer be written as Lakehouse tables. Ensure that you do not delete offload configurations or Lakehouse tables to prevent data loss.
Supported schema types vary for different table formats
We only support AVRO and Pulsar primitive schema for Delta Lake tables
Inability to configure different table formats for distinct namespaces or topics
Delta Lake-specific limitations:
Avoid using Spark for small file compaction to prevent interference with streaming reads from Delta tables. Separate support for the compaction feature is available.
In Lakehouse Storage, the RawReader interface is utilized to retrieve messages from Pulsar topics and write them to the Lakehouse. A critical component of this process is the offload cursor, which marks the progress of the offloading operation. It’s crucial to note that prematurely advancing the offload cursor can lead to data loss in writing to lakehouse tables.When configuring Time-to-Live (TTL) settings at the namespace or topic level, consider the following two key options:
Do Not Configure TTL at Namespace or Topic Level:
This option ensures that no TTL constraints are imposed at the namespace or topic level, allowing data to persist without automatic expiration based on time.
Configure TTL at Namespace or Topic Level:
If TTL settings are implemented at the namespace or topic level, it is crucial to ensure that the TTL value is greater than the retention policy set for the topic or namespace.
By carefully considering these options and aligning TTL configurations with retention policies, organizations can effectively manage the data lifecycle and retention within the Lakehouse Storage, thereby safeguarding against unintended data loss scenarios.