1. Integrate with Data Lakehouse
  2. Lakehouse Storage

Lakehouse Storage Overview

Note

This feature is currently in private preview. If you want to try it out or have any questions, submit a ticket to the support team.

Introduction

Lakehouse Storage is part of the Ursa Engine, which seamlessly integrates with Pulsar brokers. It offers the capability to write streaming data to popular lakehouse products like Delta Lake and Apache Iceberg, while supporting open table formats. This feature enables both streaming and batch-read operations from the lakehouse, enhancing data accessibility and processing efficiency.

Lakehouse storage

Integration with Apache Pulsar and Lakehouse Ecosystems

The lakehouse storage system collaborates harmoniously with Apache Pulsar and the broader lakehouse ecosystems, enabling seamless data flows and interactions among these components.

Lakehouse Storage Ecosystem

Architecture Overview

The architecture of Lakehouse Storage is designed to optimize data management and retrieval processes, ensuring robust performance and scalability.

Lakehouse Storage Architecture

Key Capabilities

1. Writing Data to Lakehouse

Effortlessly writing data from Pulsar topics to leading Lakehouse products in real-time, including Delta Lake and Apache Iceberg, all while maintaining compatibility with open data lakehouse formats.

2. Streaming Read Capabilities

Enable streaming read operations from Lakehouse tables using Pulsar clients, ensuring timely access to real-time data streams for various use cases.

3. Batch Read Capabilities

Facilitate batch read operations from Lakehouse products through popular query engines like Spark SQL, Flink SQL, and Trino, enhancing data analytics and processing capabilities.

Benefits of Lakehouse Storage

  • Long-Term Data Retention: Define retention & offload policies to store data in latency-optimized WAL (currently BookKeeper) for real-time processing and in Lakehouse products for analytical processing, ensuring comprehensive data retention strategies.
  • Cost-Effective Storage: Utilize Lakehouse products for storing data streamns in lakehouse table formats, offering a cost-effective storage solution.
  • Unified Data Platform: Pulsar serves as a shared data storage platform for real-time and batch data processing needs, enhancing operational efficiency.
  • Schema Evolution Management: Lakehouse Storage seamlessly handles schema evolution, ensuring synchronization between Pulsar topics and Lakehouse tables.
  • Advanced Data Management Features: Benefit from data versioning, auditing, indexing, caching, and query optimization capabilities, which merge the real-time capabilities of data streams with the advantages of data lakehouses.

Next Steps

Previous
Kafka Connect Quick Start