Lakehouse Table Overview - StreamNative Documentation

Lakehouse Table is a zero-ETL integration that automatically converts streaming data from Apache Pulsar topics into open table formats — Apache Iceberg and Delta Lake — stored directly on object storage (AWS S3, GCS, Azure Blob Storage). This enables unified streaming and analytics access to the same data without building or maintaining separate data pipelines.

Architecture

                          ┌──────────────────┐
                          │   Pulsar Broker  │
                          └────────┬─────────┘
                                   │
                   ┌───────────────┴────────────────┐
                   │  WAL Storage                   │
                   │  ┌──────────────────────────┐  │
                   │  │ Latency-Optimized:       │  │
                   │  │   Apache BookKeeper      │  │
                   │  ├──────────────────────────┤  │
                   │  │ Cost-Optimized:          │  │
                   │  │   Object Storage         │  │
                   │  │   (S3 / GCS / Azure)     │  │
                   │  └──────────────────────────┘  │
                   └───────────────┬────────────────┘
                                   │ Reads from both
                                   ▼
                    ┌─────────────────────────────┐
                    │    Compaction Service       │
                    │  (WAL → Parquet conversion) │
                    └────────────┬────────────────┘
                                 │ Commit
                                 ▼
                    ┌─────────────────────────────┐
                    │    Lakehouse Table          │
                    │  (Iceberg / Delta Lake)     │
                    └─────────────────────────────┘
                                 │
                    ┌────────────┴────────────┐
                    ▼                         ▼
             External Catalog          Query Engines
          (Unity Catalog, S3Table,   (Spark, Trino,
           BigLake, Snowflake)        DuckDB, Athena)

WAL Storage Options

Lakehouse Table supports two WAL storage tiers:

Latency-optimized (Apache BookKeeper): Low-latency writes for performance-sensitive workloads
Cost-optimized (Object Storage): Direct writes to AWS S3, GCS, or Azure Blob Storage for cost efficiency

The Compaction Service reads from both BookKeeper and Object Storage, converts the data to Parquet format, and commits snapshots to the lakehouse catalog.

Coordination

Oxia serves as the metadata store for coordination, leader election, schema storage, and offset index management. The Compaction Service operates with a leader-worker architecture: the leader publishes compaction tasks and commits results to the lakehouse catalog, while workers perform the WAL-to-Parquet conversion.

Table Modes

External Table (SDT — Stream Delivered Table)

An External Table delivers data from Pulsar topics into an external lakehouse catalog (such as Databricks Unity Catalog, Snowflake, AWS S3Table, or Google BigLake). A separate copy of the topic data is written to the Lakehouse table — the Pulsar topic and the Lakehouse table hold independent copies of the same records. In this mode:

Data is written to Iceberg or Delta Lake tables managed by the external catalog as a separate copy from the Pulsar topic
Analytical access via standard table APIs (Spark, Trino, DuckDB, Athena, etc.)
Supports upsert, partition key, and schema evolution
The external catalog governs data lifecycle (retention, deletion) for the Lakehouse copy independently of the Pulsar topic
Streaming reads with offset semantics are not supported on the delivered data

Use External Tables when: you want to deliver streaming data into curated lakehouse tables for analytics, integrate with existing data platforms, or need upsert/deduplication capabilities.

Internal Table (SBT — Stream Backed Table)

Coming Soon — Internal Table support is under active development.

An Internal Table is managed entirely by Ursa Storage. The Pulsar topic and the Lakehouse table share the same single copy of data — there is no separate write to the Lakehouse table. The same physical data supports both streaming reads (with offset tracking and replay) and analytical queries — true stream-table duality with zero data duplication.

Supported Cluster Profiles and Protocols

StreamNative Private Cloud offers two cluster types (Pulsar and Kafka), each with two performance profiles (latency-optimized and cost-optimized). Lakehouse delivery support depends on the combination of cluster type, profile, and producer protocol.

Cluster Type	Profile	Producer Protocol	Lakehouse Delivery
Pulsar	Latency-optimized	Pulsar	Supported
Pulsar	Latency-optimized	Kafka	Coming Soon
Pulsar	Cost-optimized	Kafka	Supported
Pulsar	Cost-optimized	Pulsar	Not yet supported
Kafka	Cost-optimized	Kafka	Supported
Kafka	Latency-optimized	Kafka	Coming Soon

Notes:

A Pulsar latency-optimized cluster uses Apache BookKeeper as the WAL tier. Topic data produced via the Pulsar protocol can be delivered to Lakehouse today; Kafka-protocol delivery is on the roadmap.
A Pulsar cost-optimized cluster uses object storage as the WAL tier. Topic data produced via the Kafka protocol is delivered to Lakehouse; the Pulsar protocol is not yet supported on this profile.
A Kafka cost-optimized cluster delivers Kafka topic data to Lakehouse today.
A Kafka latency-optimized cluster will support Lakehouse delivery in a future release.

Supported Formats

Format	Status
Apache Iceberg	Supported
Delta Lake	Supported

Supported Cloud Storage

Provider	WAL Storage	Lakehouse Table
AWS S3	Supported	Supported
Google Cloud Storage	Supported	Supported
Azure Blob Storage	Supported	Supported

Supported Catalogs

Catalog support varies by cloud provider:

Catalog	Table Format	AWS	GCP	Azure
Databricks Unity Catalog (Managed Iceberg)	Iceberg	Supported	Supported	Supported
Databricks Unity Catalog (Delta Lake)	Delta Lake	Supported	Supported	Supported
Snowflake Horizon Catalog	Iceberg	Supported	Supported	Supported
Snowflake Open Catalog (Polaris)	Iceberg	Supported	Supported	Supported
AWS S3Table	Iceberg	Supported	—	—
Google BigLake	Iceberg	—	Supported	—

Topic to lakehouse identifier mapping

When data is delivered from a Pulsar topic to a lakehouse table, the topic’s tenant, namespace, and topic name are mapped to a catalog namespace and a table name. The mapping rules differ by catalog type because Pulsar allows characters (/, ., -, :) that are not valid in many catalog identifiers. The compaction service applies the following rules.

Iceberg with hierarchical catalogs (Snowflake Open Catalog, Snowflake Horizon, Iceberg REST, Iceberg Hadoop)

The original Pulsar identifiers are used unchanged:

Catalog namespace: <tenant>.<namespace> (two-level)
Table name: the topic local name (the part after the namespace)

For example, the topic persistent://my-tenant/my-namespace/orders is mapped to namespace my-tenant.my-namespace and table orders.

Iceberg with flat-namespace catalogs (AWS S3Tables, Google BigLake, Hive)

These catalogs only accept a single-level namespace, so the tenant and namespace are flattened into one identifier with a cluster-name prefix. Each component is escaped to remove invalid characters:

Source character	Replacement
`/`	`___` (three underscores)
`.`	`_` (one underscore)
`-`	`__` (two underscores)
`:`	`____` (four underscores)

Catalog namespace: <cluster>_<formatted-tenant>_<formatted-namespace> (default cluster prefix is pulsar; configurable via the cluster property in the compaction service)
Table name (S3Tables): the topic local name with the same character escapes applied
Table name (BigLake, Hive): the topic local name as-is

For example, with the default cluster prefix pulsar, topic persistent://public-v1/default.v2/test-table-v1:

On AWS S3Tables: namespace pulsar_public__v1_default_v2, table test__table__v1
On Google BigLake: namespace pulsar_public__v1_default_v2, table test-table-v1

Databricks Unity Catalog (Iceberg or Delta Lake)

Unity Catalog uses a three-level identifier (catalog.schema.table). The compaction service writes all topics into a single schema and encodes the full Pulsar topic path into the table name, so each catalog table maps 1:1 to a Pulsar topic. The full topic path <tenant>/<namespace>/<topic> is flattened with these escapes:

Source character	Replacement
`/`	`__` (two underscores)
`.`	`____` (four underscores)
`-`	`___` (three underscores)

For example, topic persistent://public/default/test-topic is mapped to table name public__default__test___topic. Topic persistent://public/default/v1.events is mapped to public__default__v1____events.

Next Steps

Deploy Lakehouse Table — Set up the infrastructure with private-cloud.yaml
Prepare Lakehouse Catalogs — Set up your external catalog service
Dynamic Configuration Guide — Cluster-name prefix, override priority, and the full set of dynamic configuration keys
Configure Lakehouse Catalogs — Connect catalogs to the compaction service
Enable Lakehouse Integration — Enable at cluster, namespace, or topic level
Features:
Observability — Metrics, alerts, and Grafana dashboard

Documentation Index

​Architecture

​WAL Storage Options

​Coordination

​Table Modes

​External Table (SDT — Stream Delivered Table)

​Internal Table (SBT — Stream Backed Table)

​Supported Cluster Profiles and Protocols

​Supported Formats

​Supported Cloud Storage

​Supported Catalogs

​Topic to lakehouse identifier mapping

​Iceberg with hierarchical catalogs (Snowflake Open Catalog, Snowflake Horizon, Iceberg REST, Iceberg Hadoop)

​Iceberg with flat-namespace catalogs (AWS S3Tables, Google BigLake, Hive)

​Databricks Unity Catalog (Iceberg or Delta Lake)

​Next Steps