Architecture
WAL Storage Options
Lakehouse Table supports two WAL storage tiers:- Latency-optimized (Apache BookKeeper): Low-latency writes for performance-sensitive workloads
- Cost-optimized (Object Storage): Direct writes to AWS S3, GCS, or Azure Blob Storage for cost efficiency
Coordination
Oxia serves as the metadata store for coordination, leader election, schema storage, and offset index management. The Compaction Service operates with a leader-worker architecture: the leader publishes compaction tasks and commits results to the lakehouse catalog, while workers perform the WAL-to-Parquet conversion.Table Modes
External Table (SDT — Stream Delivered Table)
An External Table delivers data from Pulsar topics into an external lakehouse catalog (such as Databricks Unity Catalog, Snowflake, AWS S3Table, or Google BigLake). A separate copy of the topic data is written to the Lakehouse table — the Pulsar topic and the Lakehouse table hold independent copies of the same records. In this mode:- Data is written to Iceberg or Delta Lake tables managed by the external catalog as a separate copy from the Pulsar topic
- Analytical access via standard table APIs (Spark, Trino, DuckDB, Athena, etc.)
- Supports upsert, partition key, and schema evolution
- The external catalog governs data lifecycle (retention, deletion) for the Lakehouse copy independently of the Pulsar topic
- Streaming reads with offset semantics are not supported on the delivered data
Internal Table (SBT — Stream Backed Table)
Coming Soon — Internal Table support is under active development.An Internal Table is managed entirely by Ursa Storage. The Pulsar topic and the Lakehouse table share the same single copy of data — there is no separate write to the Lakehouse table. The same physical data supports both streaming reads (with offset tracking and replay) and analytical queries — true stream-table duality with zero data duplication.
Supported Cluster Profiles and Protocols
StreamNative Private Cloud offers two cluster types (Pulsar and Kafka), each with two performance profiles (latency-optimized and cost-optimized). Lakehouse delivery support depends on the combination of cluster type, profile, and producer protocol.| Cluster Type | Profile | Producer Protocol | Lakehouse Delivery |
|---|---|---|---|
| Pulsar | Latency-optimized | Pulsar | Supported |
| Pulsar | Latency-optimized | Kafka | Coming Soon |
| Pulsar | Cost-optimized | Kafka | Supported |
| Pulsar | Cost-optimized | Pulsar | Not yet supported |
| Kafka | Cost-optimized | Kafka | Supported |
| Kafka | Latency-optimized | Kafka | Coming Soon |
- A Pulsar latency-optimized cluster uses Apache BookKeeper as the WAL tier. Topic data produced via the Pulsar protocol can be delivered to Lakehouse today; Kafka-protocol delivery is on the roadmap.
- A Pulsar cost-optimized cluster uses object storage as the WAL tier. Topic data produced via the Kafka protocol is delivered to Lakehouse; the Pulsar protocol is not yet supported on this profile.
- A Kafka cost-optimized cluster delivers Kafka topic data to Lakehouse today.
- A Kafka latency-optimized cluster will support Lakehouse delivery in a future release.
Supported Formats
| Format | Status |
|---|---|
| Apache Iceberg | Supported |
| Delta Lake | Supported |
Supported Cloud Storage
| Provider | WAL Storage | Lakehouse Table |
|---|---|---|
| AWS S3 | Supported | Supported |
| Google Cloud Storage | Supported | Supported |
| Azure Blob Storage | Supported | Supported |
Supported Catalogs
Catalog support varies by cloud provider:| Catalog | Table Format | AWS | GCP | Azure |
|---|---|---|---|---|
| Databricks Unity Catalog (Managed Iceberg) | Iceberg | Supported | Supported | Supported |
| Databricks Unity Catalog (Delta Lake) | Delta Lake | Supported | Supported | Supported |
| Snowflake Horizon Catalog | Iceberg | Supported | Supported | Supported |
| Snowflake Open Catalog (Polaris) | Iceberg | Supported | Supported | Supported |
| AWS S3Table | Iceberg | Supported | — | — |
| Google BigLake | Iceberg | — | Supported | — |
Topic to lakehouse identifier mapping
Each Pulsar topic maps to exactly one Lakehouse table. The mapping is 1:1 regardless of how many partitions the topic has — data from all partitions of a partitioned topic is consolidated into a single Lakehouse table. When data is delivered from a Pulsar topic to a lakehouse table, the topic’s tenant, namespace, and topic name are mapped to a catalog namespace and a table name. The mapping rules differ by catalog type because Pulsar allows characters (/, ., -, :) that are not valid in many catalog identifiers.
The compaction service applies the following rules.
Iceberg with hierarchical catalogs (Snowflake Open Catalog, Snowflake Horizon, Iceberg REST, Iceberg Hadoop)
The original Pulsar identifiers are used unchanged:- Catalog namespace:
<tenant>.<namespace>(two-level) - Table name: the topic local name (the part after the namespace)
persistent://my-tenant/my-namespace/orders is mapped to namespace my-tenant.my-namespace and table orders.
Iceberg with flat-namespace catalogs (AWS S3Tables, Google BigLake, Hive)
These catalogs only accept a single-level namespace, so the tenant and namespace are flattened into one identifier with a cluster-name prefix. Each component is escaped to remove invalid characters:| Source character | Replacement |
|---|---|
/ | ___ (three underscores) |
. | _ (one underscore) |
- | __ (two underscores) |
: | ____ (four underscores) |
- Catalog namespace:
<cluster>_<formatted-tenant>_<formatted-namespace>(default cluster prefix ispulsar; configurable via theclusterproperty in the compaction service) - Table name (S3Tables): the topic local name with the same character escapes applied
- Table name (BigLake, Hive): the topic local name as-is
pulsar, topic persistent://public-v1/default.v2/test-table-v1:
- On AWS S3Tables: namespace
pulsar_public__v1_default_v2, tabletest__table__v1 - On Google BigLake: namespace
pulsar_public__v1_default_v2, tabletest-table-v1
Databricks Unity Catalog (Iceberg or Delta Lake)
Unity Catalog uses a three-level identifier (catalog.schema.table). The compaction service writes all topics into a single schema and encodes the full Pulsar topic path into the table name, so each catalog table maps 1:1 to a Pulsar topic. The full topic path <tenant>/<namespace>/<topic> is flattened with these escapes:
| Source character | Replacement |
|---|---|
/ | __ (two underscores) |
. | ____ (four underscores) |
- | ___ (three underscores) |
persistent://public/default/test-topic is mapped to table name public__default__test___topic. Topic persistent://public/default/v1.events is mapped to public__default__v1____events.
Limitations
Schema limitations
The following topic schema constructs are not supported when delivering data to a Lakehouse Table:- Recursive schemas: Schemas that reference themselves — directly or indirectly — are not supported. For example, a record with a field whose type is the same record (such as a tree node that holds a list of child nodes of the same type). Iceberg and Delta Lake require a fixed, finite column structure, which cannot represent a self-referential schema. Topics that use a recursive schema cannot be delivered to a Lakehouse Table.
Next Steps
- Deploy Lakehouse Table — Set up the infrastructure with
private-cloud.yaml - Prepare Lakehouse Catalogs — Set up your external catalog service
- Dynamic Configuration Guide — Cluster-name prefix, override priority, and the full set of dynamic configuration keys
- Configure Lakehouse Catalogs — Connect catalogs to the compaction service
- Enable Lakehouse Integration — Enable at cluster, namespace, or topic level
- Features:
- Observability — Metrics, alerts, and Grafana dashboard