1. Private Preview
  2. Lakehouse Tiered Storage

Lakehouse tiered storage Overview

Note

This feature is currently in private preview. If you want to try it out or have any questions, submit a ticket to the support team.

Introduction

Lakehouse Tiered storage empowers Apache Pulsar as a Lakehouse. It seamlessly integrates with Pulsar brokers, offering the capability to offload streaming data to popular Lakehouse products like Delta Lake, all while supporting open formats. This feature enables both streaming and batch-read operations from the Lakehouse, enhancing data accessibility and processing efficiency.

Lakehouse tiered storage

Integration with Apache Pulsar and Lakehouse Ecosystems

The tiered storage system collaborates harmoniously with Apache Pulsar and the broader Lakehouse ecosystems, facilitating seamless data flows and interactions between these components.

Lakehouse Tiered Storage Ecosystem

Architecture Overview

The architecture of Lakehouse tiered storage is designed to optimize data management and retrieval processes, ensuring robust performance and scalability.

Lakehouse tiered storage architecture

Key Features

1. Data Offloading to Lakehouse

Effortlessly offload data from Pulsar topics to leading Lakehouse products in real-time, including Delta Lake, all while maintaining compatibility with open data formats.

2. Streaming Read Capabilities

Enable streaming read operations from Lakehouse tables using Pulsar clients, ensuring timely access to real-time data streams for various use cases.

3. Batch Read Functionality

Facilitate batch read operations from Lakehouse products through popular query engines like Spark SQL, Flink SQL, and Trino, enhancing data analytics and processing capabilities.

Benefits of Lakehouse Tiered Storage

  • Long-Term Data Retention: Define offload policies to store data in BookKeeper for real-time processing and in Lakehouse products for batch processing, ensuring comprehensive data retention strategies.
  • Cost-Effective Storage: Utilize Lakehouse products for storing cold data with open formats and compression, offering a cost-effective storage solution.
  • Unified Data Platform: Pulsar serves as a unified data storage and processing platform for real-time and batch data processing needs, enhancing operational efficiency.
  • Schema Evolution Management: Lakehouse Tiered Storage seamlessly handles schema evolution, ensuring synchronization between Pulsar topics and Lakehouse tables.
  • Data Query and Analysis: Enable data querying in Lakehouse products and utilize Pulsar consumers/readers to access data from BookKeeper and Lakehouse products.
  • Advanced Data Management Features: Benefit from data versioning, auditing, indexing, caching, and query optimization capabilities, merging the advantages of data lakes and data warehouses.

Next Steps

Previous
Release Channel