> ## Documentation Index
> Fetch the complete documentation index at: https://docs.streamnative.io/llms.txt
> Use this file to discover all available pages before exploring further.

# Lakehouse Table Overview

Lakehouse Table is a zero-ETL integration that automatically converts streaming data from Apache Pulsar topics into open table formats -- Apache Iceberg and Delta Lake -- stored directly on object storage (AWS S3, GCS, Azure Blob Storage). This enables unified streaming and analytics access to the same data without building or maintaining separate data pipelines.

## Architecture

```
                          ┌──────────────────┐
                          │   Pulsar Broker  │
                          └────────┬─────────┘
                                   │
                   ┌───────────────┴────────────────┐
                   │  WAL Storage                   │
                   │  ┌──────────────────────────┐  │
                   │  │ Latency-Optimized:       │  │
                   │  │   Apache BookKeeper      │  │
                   │  ├──────────────────────────┤  │
                   │  │ Cost-Optimized:          │  │
                   │  │   Object Storage         │  │
                   │  │   (S3 / GCS / Azure)     │  │
                   │  └──────────────────────────┘  │
                   └───────────────┬────────────────┘
                                   │ Reads from both
                                   ▼
                    ┌─────────────────────────────┐
                    │    Compaction Service       │
                    │  (WAL → Parquet conversion) │
                    └────────────┬────────────────┘
                                 │ Commit
                                 ▼
                    ┌─────────────────────────────┐
                    │    Lakehouse Table          │
                    │  (Iceberg / Delta Lake)     │
                    └─────────────────────────────┘
                                 │
                    ┌────────────┴────────────┐
                    ▼                         ▼
             External Catalog          Query Engines
          (Unity Catalog, S3Table,   (Spark, Trino,
           BigLake, Snowflake)        DuckDB, Athena)
```

### WAL Storage Options

Lakehouse Table supports two WAL storage tiers:

* **Latency-optimized (Apache BookKeeper):** Low-latency writes for performance-sensitive workloads
* **Cost-optimized (Object Storage):** Direct writes to AWS S3, GCS, or Azure Blob Storage for cost efficiency

The **Compaction Service reads from both** BookKeeper and Object Storage, converts the data to Parquet format, and commits snapshots to the lakehouse catalog.

### Coordination

**Oxia** serves as the metadata store for coordination, leader election, schema storage, and offset index management.

The Compaction Service operates with a leader-worker architecture: the leader publishes compaction tasks and commits results to the lakehouse catalog, while workers perform the WAL-to-Parquet conversion.

## Table Modes

### External Table (SDT -- Stream Delivered Table)

An External Table delivers data from Pulsar topics into an external lakehouse catalog (such as Databricks Unity Catalog, Snowflake, AWS S3Table, or Google BigLake). A **separate copy** of the topic data is written to the Lakehouse table — the Pulsar topic and the Lakehouse table hold independent copies of the same records. In this mode:

* Data is written to Iceberg or Delta Lake tables managed by the external catalog as a separate copy from the Pulsar topic
* Analytical access via standard table APIs (Spark, Trino, DuckDB, Athena, etc.)
* Supports **upsert**, **partition key**, and **schema evolution**
* The external catalog governs data lifecycle (retention, deletion) for the Lakehouse copy independently of the Pulsar topic
* Streaming reads with offset semantics are **not** supported on the delivered data

**Use External Tables when:** you want to deliver streaming data into curated lakehouse tables for analytics, integrate with existing data platforms, or need upsert/deduplication capabilities.

### Internal Table (SBT -- Stream Backed Table)

> **Coming Soon** -- Internal Table support is under active development.

An Internal Table is managed entirely by Ursa Storage. The Pulsar topic and the Lakehouse table **share the same single copy of data** — there is no separate write to the Lakehouse table. The same physical data supports both streaming reads (with offset tracking and replay) and analytical queries -- true stream-table duality with zero data duplication.

## Supported Cluster Profiles and Protocols

StreamNative Private Cloud offers two cluster types (Pulsar and Kafka), each with two performance profiles (latency-optimized and cost-optimized). Lakehouse delivery support depends on the combination of cluster type, profile, and producer protocol.

| Cluster Type | Profile           | Producer Protocol | Lakehouse Delivery |
| ------------ | ----------------- | ----------------- | ------------------ |
| Pulsar       | Latency-optimized | Pulsar            | Supported          |
| Pulsar       | Latency-optimized | Kafka             | Coming Soon        |
| Pulsar       | Cost-optimized    | Kafka             | Supported          |
| Pulsar       | Cost-optimized    | Pulsar            | Not yet supported  |
| Kafka        | Cost-optimized    | Kafka             | Supported          |
| Kafka        | Latency-optimized | Kafka             | Coming Soon        |

Notes:

* A **Pulsar latency-optimized** cluster uses Apache BookKeeper as the WAL tier. Topic data produced via the Pulsar protocol can be delivered to Lakehouse today; Kafka-protocol delivery is on the roadmap.
* A **Pulsar cost-optimized** cluster uses object storage as the WAL tier. Topic data produced via the Kafka protocol is delivered to Lakehouse; the Pulsar protocol is not yet supported on this profile.
* A **Kafka cost-optimized** cluster delivers Kafka topic data to Lakehouse today.
* A **Kafka latency-optimized** cluster will support Lakehouse delivery in a future release.

## Supported Formats

| Format         | Status    |
| -------------- | --------- |
| Apache Iceberg | Supported |
| Delta Lake     | Supported |

## Supported Cloud Storage

| Provider             | WAL Storage | Lakehouse Table |
| -------------------- | ----------- | --------------- |
| AWS S3               | Supported   | Supported       |
| Google Cloud Storage | Supported   | Supported       |
| Azure Blob Storage   | Supported   | Supported       |

## Supported Catalogs

Catalog support varies by cloud provider:

| Catalog                                    | Table Format | AWS       | GCP       | Azure     |
| ------------------------------------------ | ------------ | --------- | --------- | --------- |
| Databricks Unity Catalog (Managed Iceberg) | Iceberg      | Supported | Supported | Supported |
| Databricks Unity Catalog (Delta Lake)      | Delta Lake   | Supported | Supported | Supported |
| Snowflake Horizon Catalog                  | Iceberg      | Supported | Supported | Supported |
| Snowflake Open Catalog (Polaris)           | Iceberg      | Supported | Supported | Supported |
| AWS S3Table                                | Iceberg      | Supported | --        | --        |
| Google BigLake                             | Iceberg      | --        | Supported | --        |

## Topic to lakehouse identifier mapping

Each Pulsar topic maps to exactly **one** Lakehouse table. The mapping is 1:1 regardless of how many partitions the topic has — data from all partitions of a partitioned topic is consolidated into a single Lakehouse table.

When data is delivered from a Pulsar topic to a lakehouse table, the topic's tenant, namespace, and topic name are mapped to a catalog **namespace** and a **table name**. The mapping rules differ by catalog type because Pulsar allows characters (`/`, `.`, `-`, `:`) that are not valid in many catalog identifiers.

The compaction service applies the following rules.

### Iceberg with hierarchical catalogs (Snowflake Open Catalog, Snowflake Horizon, Iceberg REST, Iceberg Hadoop)

The original Pulsar identifiers are used unchanged:

* Catalog namespace: `<tenant>.<namespace>` (two-level)
* Table name: the topic local name (the part after the namespace)

For example, the topic `persistent://my-tenant/my-namespace/orders` is mapped to namespace `my-tenant.my-namespace` and table `orders`.

### Iceberg with flat-namespace catalogs (AWS S3Tables, Google BigLake, Hive)

These catalogs only accept a single-level namespace, so the tenant and namespace are flattened into one identifier with a cluster-name prefix. Each component is escaped to remove invalid characters:

| Source character | Replacement               |
| ---------------- | ------------------------- |
| `/`              | `___` (three underscores) |
| `.`              | `_` (one underscore)      |
| `-`              | `__` (two underscores)    |
| `:`              | `____` (four underscores) |

* Catalog namespace: `<cluster>_<formatted-tenant>_<formatted-namespace>` (default cluster prefix is `pulsar`)
* Table name (S3Tables): the topic local name with the same character escapes applied
* Table name (BigLake, Hive): the topic local name as-is

For example, with the default cluster prefix `pulsar`, topic `persistent://public-v1/default.v2/test-table-v1`:

* On **AWS S3Tables**: namespace `pulsar_public__v1_default_v2`, table `test__table__v1`
* On **Google BigLake**: namespace `pulsar_public__v1_default_v2`, table `test-table-v1`

### Databricks Unity Catalog (Iceberg or Delta Lake)

Unity Catalog uses a three-level identifier (`catalog.schema.table`). The compaction service writes all topics into a single schema and encodes the full Pulsar topic path into the **table name**, so each catalog table maps 1:1 to a Pulsar topic. The full topic path `<tenant>/<namespace>/<topic>` is flattened with these escapes:

| Source character | Replacement               |
| ---------------- | ------------------------- |
| `/`              | `__` (two underscores)    |
| `.`              | `____` (four underscores) |
| `-`              | `___` (three underscores) |

For example, topic `persistent://public/default/test-topic` is mapped to table name `public__default__test___topic`. Topic `persistent://public/default/v1.events` is mapped to `public__default__v1____events`.

## Limitations

### Schema limitations

The following topic schema constructs are not supported when delivering data to a Lakehouse Table:

* **Recursive schemas:** Schemas that reference themselves -- directly or indirectly -- are not supported. For example, a record with a field whose type is the same record (such as a tree node that holds a list of child nodes of the same type). Iceberg and Delta Lake require a fixed, finite column structure, which cannot represent a self-referential schema. Topics that use a recursive schema cannot be delivered to a Lakehouse Table.

## Next Steps

* [Deploy Lakehouse Table](/cloud/lakehouse/deploy-lakehouse-table) -- Set up the infrastructure with `private-cloud.yaml`
* [Prepare Lakehouse Catalogs](/cloud/lakehouse/prepare-lakehouse-catalogs) -- Set up your external catalog service
* [Dynamic Configuration Guide](/cloud/lakehouse/dynamic-configuration) -- Cluster-name prefix, override priority, and the full set of dynamic configuration keys
* [Register Lakehouse Catalogs](/cloud/lakehouse/catalogs/register-catalog) -- Connect catalogs to the compaction service
* [Enable Lakehouse Integration](/cloud/lakehouse/enable-lakehouse-integration) -- Enable at cluster, namespace, or topic level
* Features:
  * [Schema Evolution](/cloud/lakehouse/features/schema-evolution)
  * [Variant Type](/cloud/lakehouse/features/variant-type)
  * [Partition Key](/cloud/lakehouse/features/iceberg-partition-key)
  * [Upsert](/cloud/lakehouse/features/iceberg-upsert)
  * [Persist Key](/cloud/lakehouse/features/persist-key)
  * [Persist Extra Metadata](/cloud/lakehouse/features/persist-extra-metadata)
* [Observability](/cloud/lakehouse/lakehouse-observability) -- Metrics, alerts, and Grafana dashboard