1. Private Preview
  2. Lakehouse Tiered Storage

Understand Lakehouse Tiered Storage

Note

This feature is currently in private preview. If you want to try it out or have any questions, submit a ticket to the support team.

Overview

Lakehouse, pioneered by Databricks, represents a data management system that leverages cost-effective, directly-accessible storage while incorporating traditional analytical DBMS features like ACID transactions, data versioning, auditing, indexing, caching, and query optimization. It combines the advantages of data lakes and data warehouses, offering low-cost storage in an open format with robust management and optimization capabilities.

Our objective is to transform Pulsar into a Lakehouse by seamlessly integrating it with leading Lakehouse products such as Delta Lake. This integration aims to establish Pulsar as a unified platform for data storage and processing, catering to both real-time and batch data processing needs.

Lakehouse Tiered Storage

There are two primary methods to integrate Pulsar with Lakehouse products: using the Pulsar lakehouse sink connector to sink data to Lakehouse tables or utilizing Lakehouse products as a cold storage layer for Pulsar. This document focuses on the latter approach.

Lakehouse Tiered Storage is a feature that enables Pulsar to store warm data in BookKeeper and cold data in Lakehouse products concurrently. It allows for data storage in Pulsar for real-time processing and in Lakehouse products for batch processing. Users can define an offload policy to determine which data is stored in BookKeeper and which is stored in Lakehouse products, enabling long-term data retention and batch processing support.

Lakehouse Tiered Storage Ecosystem

Mapping Pulsar Topics to Lakehouse Tables

Pulsar topics are mapped to Lakehouse tables based on their characteristics:

  • Partitioned topics map to Lakehouse tables with multiple partitions, mirroring the topic's partition structure.
  • Non-partitioned topics map to Lakehouse tables with a single partition.

Pulsar Topics Mapped to Lakehouse Tables

Schema Evolution

Schema evolution allows the schema of a topic to evolve over time, necessitating corresponding updates in the Lakehouse table schema. Lakehouse Tiered Storage manages schema evolution seamlessly, ensuring synchronization between Pulsar topics and Lakehouse tables.

Lakehouse Tiered Storage Schema Evolution

Tiered Storage Policy

The tiered storage policy, controlled by the offload threshold at the namespace level, determines data storage allocation between BookKeeper and Lakehouse products. The offload threshold can be set based on size or time, enabling users to manage data storage efficiently.

* --size, -s
      Maximum number of bytes stored in the pulsar cluster for a topic before
      data will start being automatically offloaded to longterm storage (eg:
      10M, 16G, 3T, 100). -1 falls back to the cluster's namespace default.
      Negative values disable automatic offload. 0 triggers offloading as soon
      as possible.
      Default: -1
  --time, -t
      Maximum number of seconds stored on the pulsar cluster for a topic
      before the broker will start offloading to long-term storage (eg: 10m,
      5h, 3d, 2w).
      Default: -1
  • If the time-based and size-based offload threshold are both set to -1, the tiered storage is disabled, and all data is stored in BookKeeper.
  • If the time-based or size-based offload threshold is set to a positive number, the tiered storage is enabled, and the data that is older than the offload threshold will be offloaded to Lakehouse products.

Lakehouse Tiered Storage Data Lifecycle

Data lifecycle management post-offloading to Lakehouse products can be handled by either Pulsar or Lakehouse products. Depending on the configuration of dataManagedByPulsar, data retention and deletion policies are determined.

  • If the data lifecycle is managed by Pulsar, the data in Lakehouse products will be deleted when the Pulsar topic data is expired by the retention policy.
  • If the data lifecycle is managed by Lakehouse products, the data in Lakehouse products will not be deleted when the Pulsar topic data is expired by the retention policy. Those Pulsar topic expired data in Lakehouse products won't be acceeded by Pulsar consumers/readers, but they can be read by Lakehouse products. Thoese data can be deleted by Lakehouse products' data lifecycle management.

How Lakehouse Tiered Storage Works

Implemented as a tiered storage policy in Pulsar, Lakehouse Tiered Storage determines data storage locations based on the tiered storage policy. It involves components like the Message Container, Message Formatter, Lakehouse Writer, and Lakehouse Reader, facilitating seamless data offloading and retrieval between Pulsar and Lakehouse products.

Lakehouse Tiered Storage Architecture

In the diagram, the Lakehouse Tiered Storage has three components.

  • Message Container: The Lakehouse tiered storage creates a subscription on the topic and keeps reading new messages from the topic. The fetched new messages will be put into the container and the container will be transfered to the formatter when it is full.
  • Message Formatter: The formatter will parse the messages based on the topic schema and convert the messages to Lakehouse format. If topic schema evolves, the formatter will handle the schema evolution and update the Lakehouse table schema at the same time.
  • Lakehouse Writer: The writer will write the formatted messages to Lakehouse products. The writer will also update the metadata in Pulsar to indicate that the data is stored in Lakehouse products. The writer supports multiple compression types, such as snappy, lz4 and gzip, to compression the target files. For the target file, the writer will write the data to the target file in parquet format by default. The writer will also support other formats, such as ORC, in the future. The writer supports Delta Lake.
  • Lakehouse Reader: When pulsar consumers/readers read the data from the topic, the reader will read the data from Lakehouse products and pass the fetched records to Message Formatter to convert the data to Pulsar format.

Note: Due to the encoding and decoding requirements of Lakehouse Tiered Storage, brokers need to allocate sufficient CPU and heap memory for efficient message processing.

Previous
Get Started