- Private Preview
- Lakehouse Tiered Storage
Lakehouse tiered storage Overview
Note
This feature is currently in private preview. If you want to try it out or have any questions, submit a ticket to the support team.
Introduction
Lakehouse Tiered storage empowers Apache Pulsar as a Lakehouse. It seamlessly integrates with Pulsar brokers, offering the capability to offload streaming data to popular Lakehouse products like Delta Lake, all while supporting open formats. This feature enables both streaming and batch-read operations from the Lakehouse, enhancing data accessibility and processing efficiency.
Integration with Apache Pulsar and Lakehouse Ecosystems
The tiered storage system collaborates harmoniously with Apache Pulsar and the broader Lakehouse ecosystems, facilitating seamless data flows and interactions between these components.
Architecture Overview
The architecture of Lakehouse tiered storage is designed to optimize data management and retrieval processes, ensuring robust performance and scalability.
Key Features
1. Data Offloading to Lakehouse
Effortlessly offload data from Pulsar topics to leading Lakehouse products in real-time, including Delta Lake, all while maintaining compatibility with open data formats.
2. Streaming Read Capabilities
Enable streaming read operations from Lakehouse tables using Pulsar clients, ensuring timely access to real-time data streams for various use cases.
3. Batch Read Functionality
Facilitate batch read operations from Lakehouse products through popular query engines like Spark SQL, Flink SQL, and Trino, enhancing data analytics and processing capabilities.
Benefits of Lakehouse Tiered Storage
- Long-Term Data Retention: Define offload policies to store data in BookKeeper for real-time processing and in Lakehouse products for batch processing, ensuring comprehensive data retention strategies.
- Cost-Effective Storage: Utilize Lakehouse products for storing cold data with open formats and compression, offering a cost-effective storage solution.
- Unified Data Platform: Pulsar serves as a unified data storage and processing platform for real-time and batch data processing needs, enhancing operational efficiency.
- Schema Evolution Management: Lakehouse Tiered Storage seamlessly handles schema evolution, ensuring synchronization between Pulsar topics and Lakehouse tables.
- Data Query and Analysis: Enable data querying in Lakehouse products and utilize Pulsar consumers/readers to access data from BookKeeper and Lakehouse products.
- Advanced Data Management Features: Benefit from data versioning, auditing, indexing, caching, and query optimization capabilities, merging the advantages of data lakes and data warehouses.