compactionScheduler.config.custom section of the PulsarBroker YAML.
Multi-Catalog Architecture
StreamNative supports configuring multiple catalogs simultaneously. Different namespaces or topics can route data to different catalogs.Configuration Pattern
- Iceberg catalogs:
iceberg.catalog.<catalog-name>.<property> - Delta catalogs:
delta.catalog.<catalog-name>.<property>
Default Catalog
Set the default catalog used when no topic or namespace override is specified:Catalog Resolution Priority
The catalog used for a topic is resolved in this order:Iceberg Catalogs
Unity Catalog (Managed Iceberg Table)
| Property | Description |
|---|---|
catalog-backend | UNITYCATALOG |
type | rest |
uri | Databricks workspace URL |
warehouse | Catalog name created in Databricks |
credential | Databricks access token |
oauth2-server-uri | Databricks oauth2 service uri |
scope | all-apis |
security | OAUTH2 |
vended-credentials-enabled | true |
token-refresh-enabled | true |
Snowflake Horizon Catalog
| Property | Description |
|---|---|
catalog-backend | HORIZON |
uri | Snowflake Horizon REST API endpoint |
credential | PAT token |
scope | Snowflake role scope (e.g., session:role:PUBLIC) |
warehouse | Snowflake database name |
header.X-Iceberg-Access-Delegation | vended-credentials (required) |
token-refresh-enabled | true (recommended) |
Snowflake Open Catalog (Polaris)
| Property | Description |
|---|---|
catalog-backend | POLARIS |
credential | Client ID and secret in <id>:<secret> format |
warehouse | Polaris catalog name |
header.X-Iceberg-Access-Delegation | vended-credentials |
scope | PRINCIPAL_ROLE:ALL |
token-refresh-enabled | true |
AWS S3Table
| Property | Description |
|---|---|
catalog-backend | S3TABLE |
rest.sigv4-enabled | true (required for AWS SigV4 auth) |
rest.signing-name | s3tables |
rest.signing-region | AWS region of the S3Table bucket |
uri | S3Tables REST endpoint (varies by region) |
warehouse | S3Table bucket ARN |
rest-metrics-reporting-enabled | false (S3Table does not support metric reporting) |
Important: The Ursa cluster must run in the same region as the S3Table bucket.
Google BigLake
| Property | Description |
|---|---|
catalog-backend | BIGLAKE |
warehouse | GCS bucket path from BigLake catalog properties |
header.x-goog-user-project | GCP project ID from BigLake catalog properties |
rest.auth.type | org.apache.iceberg.gcp.auth.GoogleAuthManager (fixed) |
io-impl | org.apache.iceberg.gcp.gcs.GCSFileIO (fixed) |
header.X-Iceberg-Access-Delegation | vended-credentials (fixed) |
Delta Lake Catalogs
Unity Catalog (Delta)
Authentication Options
Token-based (recommended):BYOL (Bring Your Own Lakehouse)
Enable managed commit support for Unity Catalog:Without Catalog (Direct Bucket)
If you do not need an external catalog service, data can be written directly to the object storage bucket.Required permissions: When no external catalog is used, the compaction-scheduler pod’s IAM role (AWS), service account (GCP), or workload identity (Azure) must have read, write, create, and list permissions on the target bucket. Without an external catalog, the compaction service interacts with object storage directly to create namespaces, write metadata, list existing files, and read prior snapshots. Examples of the required permissions per cloud:
Cloud Permissions AWS S3 s3:GetObject,s3:PutObject,s3:DeleteObject,s3:ListBucket,s3:GetBucketLocationon the warehouse bucket and prefixGCS storage.buckets.get,storage.objects.get,storage.objects.list,storage.objects.create,storage.objects.delete(or theStorage Object Adminrole)Azure Blob / ADLS Storage Blob Data Contributoron the container
Iceberg (Hadoop Catalog)
The default Hadoop catalog writes Iceberg metadata and data files directly to the configured storage path. No external catalog service is required.Table maintenance
Snowflake Open Catalog (Polaris) and the Hadoop catalog do not run table maintenance on your behalf. Streaming writes from the StreamNative Ursa compaction service produce many small Parquet files and accumulate snapshot history over time, which degrades query performance and inflates storage costs. You are responsible for scheduling and running maintenance against every Iceberg table written by Ursa. Run the maintenance operations below on a regular schedule. They are provided as Apache Iceberg Spark stored procedures and can be triggered from any Spark cluster (Databricks, AWS EMR, AWS Glue, GCP Dataproc, or self-managed Spark) that has the Iceberg Spark runtime, catalog credentials, and IAM access to the warehouse bucket. Maintenance operations| Operation | Purpose | Suggested cadence |
|---|---|---|
rewrite_data_files | Compact small Parquet files into fewer, larger files. Reduces file-listing overhead and improves scan performance. | Hourly to daily, depending on ingestion rate |
expire_snapshots | Drop snapshots older than the retention window so their data and manifest files can be cleaned up. | Daily; retain at least 1–7 days so in-flight readers and time-travel queries keep working |
remove_orphan_files | Delete files in the table location that are no longer referenced by any snapshot (typically left behind by failed or partial writes). | Weekly |
rewrite_manifests | Rewrite manifest files so they align with the current partition layout. Improves query planning time. | Weekly, or after large schema or partition changes |
<catalog>. Replace <catalog>, <namespace>, and <table> with your values.
- Credentials. The principal that runs maintenance must have catalog privileges to read and write the target table (for example, the same
TABLE_READ_DATA,TABLE_WRITE_DATA,TABLE_READ_PROPERTIES, andTABLE_WRITE_PROPERTIESprivileges configured for the Ursa compaction service) and IAM access to the warehouse bucket so it can read and rewrite the underlying data files. With the Hadoop catalog there is no catalog service to authenticate against — only the bucket IAM access is required. - Concurrency. Iceberg uses optimistic concurrency control. If maintenance commits race with the Ursa compaction writer, one of them retries. Schedule heavy operations (
rewrite_data_files,rewrite_manifests) during low-write windows when possible. - Retention vs. time travel.
expire_snapshotsandremove_orphan_filespermanently delete files. Choose a retention window that exceeds the longest expected read query and your time-travel SLA. - Schedule the workload. Most teams orchestrate these procedures from Databricks Jobs, AWS EMR steps, Airflow, Dagster, or a Kubernetes
CronJob. Pick a scheduler that fits your existing operational stack. - Reference. See the Iceberg Spark procedures documentation for the full parameter list, including options for partial rewrites (
where), file-size targets, and merge-on-read delete file compaction.
Delta (No Unity Catalog)
Delta tables are written directly to the configured storage path without Unity Catalog integration.Table maintenance
When Delta tables are written directly to object storage without a managed catalog, no service runs maintenance on your behalf. Streaming writes from the StreamNative Ursa compaction service produce many small Parquet files and accumulate Delta transaction-log history over time, which degrades query performance and inflates storage costs. You are responsible for scheduling and running maintenance against every Delta table written by Ursa. Run the maintenance operations below on a regular schedule. They can be executed from any Spark cluster (Databricks, AWS EMR, AWS Glue, GCP Dataproc, or self-managed Spark) that has thedelta-spark runtime and IAM access to the warehouse bucket. For background and tuning recommendations, see the Databricks Delta Lake best practices guide.
Maintenance operations
| Operation | Purpose | Suggested cadence |
|---|---|---|
OPTIMIZE | Compact small Parquet files into fewer, larger files (bin-packing). Reduces file-listing overhead and improves scan performance. | Hourly to daily, depending on ingestion rate |
OPTIMIZE … ZORDER BY | Co-locate data by frequently filtered columns so query engines can skip more files. | Weekly, or after large inserts |
VACUUM | Delete data files that are no longer referenced by the Delta log and are older than the retention threshold. | Daily or weekly; retain at least 7 days so in-flight readers and time-travel queries keep working |
<path> with the table location (s3://..., gs://..., or abfss://...).
- Credentials. The principal that runs maintenance must have IAM read, write, list, and delete permissions on the warehouse bucket so it can rewrite and remove data files.
VACUUMretention. Do not reduce the retention window below 7 days without explicitly disabling the safety check (spark.databricks.delta.retentionDurationCheck.enabled = false). Shorter windows risk breaking concurrent readers and time-travel queries.- Concurrency. Delta uses optimistic concurrency control. Bin-packing
OPTIMIZEruns do not generally conflict with append-only writes from Ursa, butOPTIMIZE … ZORDER BYrewrites larger portions of the table and is best scheduled in lower-write windows. - Schedule the workload. Most teams orchestrate maintenance from Databricks Jobs, AWS EMR steps, Airflow, Dagster, or a Kubernetes
CronJob. Pick a scheduler that fits your existing operational stack. - Reference. See the Databricks Delta best practices and the Delta Lake utility commands for full syntax, retention options, and tuning guidance.
Multi-Catalog Example
Configure two catalogs (one Polaris, one S3Table) and set a default:Limitations
- A namespace or topic can reference only one catalog at a time
- You can assign different catalogs to different topics or namespaces
- You cannot assign multiple catalogs to a single topic or namespace
Next Steps
- Enable Lakehouse Integration — Enable SDT at cluster, namespace, or topic level