Skip to main content
After preparing your external catalog, configure the compaction service to connect to it. Catalog configuration is added to the compactionScheduler.config.custom section of the PulsarBroker YAML.

Multi-Catalog Architecture

StreamNative supports configuring multiple catalogs simultaneously. Different namespaces or topics can route data to different catalogs.

Configuration Pattern

  • Iceberg catalogs: iceberg.catalog.<catalog-name>.<property>
  • Delta catalogs: delta.catalog.<catalog-name>.<property>

Default Catalog

Set the default catalog used when no topic or namespace override is specified:
custom:
  catalog.default: <catalog-name>

Catalog Resolution Priority

The catalog used for a topic is resolved in this order:
Topic property (catalog.name)
    ↓ (if not set)
Namespace property (catalog.name)
    ↓ (if not set)
Default catalog (catalog.default)
See Enable Lakehouse Integration for how to assign catalogs at namespace and topic level.

Iceberg Catalogs

Unity Catalog (Managed Iceberg Table)

compactionScheduler:
  config:
    custom:
      catalog.default: <catalog-name>
      iceberg.catalog.<catalog-name>.catalog-backend: "UNITYCATALOG"
      iceberg.catalog.<catalog-name>.type: "rest"
      iceberg.catalog.<catalog-name>.uri: "https://<workspace-url>/api/2.1/unity-catalog/iceberg-rest"
      iceberg.catalog.<catalog-name>.warehouse: "<catalog-name-in-databricks>"
      iceberg.catalog.<catalog-name>.credential: "<access-token>"
      iceberg.catalog.<catalog-name>.oauth2-server-uri: "https://<workspace-url>/oidc/v1/token"
      iceberg.catalog.<catalog-name>.scope: "all-apis"
      iceberg.catalog.<catalog-name>.security: "OAUTH2"
      iceberg.catalog.<catalog-name>.vended-credentials-enabled: "true"
      iceberg.catalog.<catalog-name>.token-refresh-enabled: "true"
PropertyDescription
catalog-backendUNITYCATALOG
typerest
uriDatabricks workspace URL
warehouseCatalog name created in Databricks
credentialDatabricks access token
oauth2-server-uriDatabricks oauth2 service uri
scopeall-apis
securityOAUTH2
vended-credentials-enabledtrue
token-refresh-enabledtrue

Snowflake Horizon Catalog

compactionScheduler:
  config:
    custom:
      catalog.default: <catalog-name>
      iceberg.catalog.<catalog-name>.catalog-backend: "HORIZON"
      iceberg.catalog.<catalog-name>.type: "rest"
      iceberg.catalog.<catalog-name>.uri: "https://<org>-<account>.snowflakecomputing.com/polaris/api/catalog"
      iceberg.catalog.<catalog-name>.credential: "<PAT-token>"
      iceberg.catalog.<catalog-name>.scope: "session:role:<role>"
      iceberg.catalog.<catalog-name>.warehouse: "<database-name>"
      iceberg.catalog.<catalog-name>.header.X-Iceberg-Access-Delegation: "vended-credentials"
      iceberg.catalog.<catalog-name>.token-refresh-enabled: "true"
PropertyDescription
catalog-backendHORIZON
uriSnowflake Horizon REST API endpoint
credentialPAT token
scopeSnowflake role scope (e.g., session:role:PUBLIC)
warehouseSnowflake database name
header.X-Iceberg-Access-Delegationvended-credentials (required)
token-refresh-enabledtrue (recommended)

Snowflake Open Catalog (Polaris)

compactionScheduler:
  config:
    custom:
      catalog.default: <catalog-name>
      iceberg.catalog.<catalog-name>.catalog-backend: "POLARIS"
      iceberg.catalog.<catalog-name>.type: "rest"
      iceberg.catalog.<catalog-name>.uri: "https://<account>.snowflakecomputing.com/polaris/api/catalog"
      iceberg.catalog.<catalog-name>.credential: "<client-id>:<client-secret>"
      iceberg.catalog.<catalog-name>.warehouse: "<catalog-name>"
      iceberg.catalog.<catalog-name>.header.X-Iceberg-Access-Delegation: "vended-credentials"
      iceberg.catalog.<catalog-name>.scope: "PRINCIPAL_ROLE:ALL"
      iceberg.catalog.<catalog-name>.token-refresh-enabled: "true"
PropertyDescription
catalog-backendPOLARIS
credentialClient ID and secret in <id>:<secret> format
warehousePolaris catalog name
header.X-Iceberg-Access-Delegationvended-credentials
scopePRINCIPAL_ROLE:ALL
token-refresh-enabledtrue

AWS S3Table

compactionScheduler:
  config:
    custom:
      catalog.default: <catalog-name>
      iceberg.catalog.<catalog-name>.catalog-backend: "S3TABLE"
      iceberg.catalog.<catalog-name>.type: "rest"
      iceberg.catalog.<catalog-name>.rest.sigv4-enabled: "true"
      iceberg.catalog.<catalog-name>.rest.signing-name: "s3tables"
      iceberg.catalog.<catalog-name>.rest.signing-region: "<region>"
      iceberg.catalog.<catalog-name>.uri: "https://s3tables.<region>.amazonaws.com/iceberg"
      iceberg.catalog.<catalog-name>.warehouse: "arn:aws:s3tables:<region>:<account>:bucket/<bucket-name>"
      iceberg.catalog.<catalog-name>.rest-metrics-reporting-enabled: "false"
PropertyDescription
catalog-backendS3TABLE
rest.sigv4-enabledtrue (required for AWS SigV4 auth)
rest.signing-names3tables
rest.signing-regionAWS region of the S3Table bucket
uriS3Tables REST endpoint (varies by region)
warehouseS3Table bucket ARN
rest-metrics-reporting-enabledfalse (S3Table does not support metric reporting)
Important: The Ursa cluster must run in the same region as the S3Table bucket.

Google BigLake

compactionScheduler:
  config:
    custom:
      catalog.default: <catalog-name>
      iceberg.catalog.<catalog-name>.catalog-backend: "BIGLAKE"
      iceberg.catalog.<catalog-name>.type: "rest"
      iceberg.catalog.<catalog-name>.uri: "https://biglake.googleapis.com/iceberg/v1/restcatalog"
      iceberg.catalog.<catalog-name>.warehouse: "gs://<bucket-name>"
      iceberg.catalog.<catalog-name>.header.x-goog-user-project: "<gcp-project-id>"
      iceberg.catalog.<catalog-name>.rest.auth.type: "org.apache.iceberg.gcp.auth.GoogleAuthManager"
      iceberg.catalog.<catalog-name>.io-impl: "org.apache.iceberg.gcp.gcs.GCSFileIO"
      iceberg.catalog.<catalog-name>.rest-metrics-reporting-enabled: "false"
      iceberg.catalog.<catalog-name>.header.X-Iceberg-Access-Delegation: "vended-credentials"
PropertyDescription
catalog-backendBIGLAKE
warehouseGCS bucket path from BigLake catalog properties
header.x-goog-user-projectGCP project ID from BigLake catalog properties
rest.auth.typeorg.apache.iceberg.gcp.auth.GoogleAuthManager (fixed)
io-implorg.apache.iceberg.gcp.gcs.GCSFileIO (fixed)
header.X-Iceberg-Access-Delegationvended-credentials (fixed)

Delta Lake Catalogs

Unity Catalog (Delta)

compactionScheduler:
  config:
    custom:
      catalog.default: <catalog-name>
      delta.catalog.<catalog-name>.unityCatalogUri: "https://<workspace-url>"
      delta.catalog.<catalog-name>.unityCatalogName: "<catalog-name-in-databricks>"
      delta.catalog.<catalog-name>.unityCatalogToken: "<access-token>"

Authentication Options

Token-based (recommended):
delta.catalog.<catalog-name>.unityCatalogToken: "<token>"
# OR from file:
delta.catalog.<catalog-name>.unityCatalogTokenFile: "/path/to/token/file"
OAuth2 (machine-to-machine):
delta.catalog.<catalog-name>.unityCatalogClientId: "<client-id>"
delta.catalog.<catalog-name>.unityCatalogClientSecret: "<client-secret>"

BYOL (Bring Your Own Lakehouse)

Enable managed commit support for Unity Catalog:
# Delta Lake
delta.catalog.<catalog-name>.unityCatalogByolEnabled: "true"

# Iceberg
iceberg.catalog.<catalog-name>.unityCatalogByolEnabled: "true"

Without Catalog (Direct Bucket)

If you do not need an external catalog service, data can be written directly to the object storage bucket.
Required permissions: When no external catalog is used, the compaction-scheduler pod’s IAM role (AWS), service account (GCP), or workload identity (Azure) must have read, write, create, and list permissions on the target bucket. Without an external catalog, the compaction service interacts with object storage directly to create namespaces, write metadata, list existing files, and read prior snapshots. Examples of the required permissions per cloud:
CloudPermissions
AWS S3s3:GetObject, s3:PutObject, s3:DeleteObject, s3:ListBucket, s3:GetBucketLocation on the warehouse bucket and prefix
GCSstorage.buckets.get, storage.objects.get, storage.objects.list, storage.objects.create, storage.objects.delete (or the Storage Object Admin role)
Azure Blob / ADLSStorage Blob Data Contributor on the container

Iceberg (Hadoop Catalog)

The default Hadoop catalog writes Iceberg metadata and data files directly to the configured storage path. No external catalog service is required.
compactionScheduler:
  config:
    lakehouseType: iceberg
    catalog.default: <catalog-name>
    iceberg.catalog.<catalog-name>.type: "hadoop"
    iceberg.catalog.<catalog-name>.warehouse: "<bucket>/suffix"
    streamTableMode: "EXTERNAL"

Table maintenance

Snowflake Open Catalog (Polaris) and the Hadoop catalog do not run table maintenance on your behalf. Streaming writes from the StreamNative Ursa compaction service produce many small Parquet files and accumulate snapshot history over time, which degrades query performance and inflates storage costs. You are responsible for scheduling and running maintenance against every Iceberg table written by Ursa. Run the maintenance operations below on a regular schedule. They are provided as Apache Iceberg Spark stored procedures and can be triggered from any Spark cluster (Databricks, AWS EMR, AWS Glue, GCP Dataproc, or self-managed Spark) that has the Iceberg Spark runtime, catalog credentials, and IAM access to the warehouse bucket. Maintenance operations
OperationPurposeSuggested cadence
rewrite_data_filesCompact small Parquet files into fewer, larger files. Reduces file-listing overhead and improves scan performance.Hourly to daily, depending on ingestion rate
expire_snapshotsDrop snapshots older than the retention window so their data and manifest files can be cleaned up.Daily; retain at least 1–7 days so in-flight readers and time-travel queries keep working
remove_orphan_filesDelete files in the table location that are no longer referenced by any snapshot (typically left behind by failed or partial writes).Weekly
rewrite_manifestsRewrite manifest files so they align with the current partition layout. Improves query planning time.Weekly, or after large schema or partition changes
Example: run maintenance from Spark The following examples assume the catalog has been registered in Spark as <catalog>. Replace <catalog>, <namespace>, and <table> with your values.
-- Compact small files. Iceberg targets files smaller than the default 512 MB.
CALL <catalog>.system.rewrite_data_files(table => '<namespace>.<table>');

-- Expire snapshots older than 3 days; keep the 5 most recent snapshots.
CALL <catalog>.system.expire_snapshots(
  table       => '<namespace>.<table>',
  older_than  => TIMESTAMP '2026-05-20 00:00:00',
  retain_last => 5
);

-- Remove orphan files older than 7 days.
CALL <catalog>.system.remove_orphan_files(
  table      => '<namespace>.<table>',
  older_than => TIMESTAMP '2026-05-20 00:00:00'
);

-- Rewrite manifests to match the current partition layout.
CALL <catalog>.system.rewrite_manifests(table => '<namespace>.<table>');
Operational guidance
  • Credentials. The principal that runs maintenance must have catalog privileges to read and write the target table (for example, the same TABLE_READ_DATA, TABLE_WRITE_DATA, TABLE_READ_PROPERTIES, and TABLE_WRITE_PROPERTIES privileges configured for the Ursa compaction service) and IAM access to the warehouse bucket so it can read and rewrite the underlying data files. With the Hadoop catalog there is no catalog service to authenticate against — only the bucket IAM access is required.
  • Concurrency. Iceberg uses optimistic concurrency control. If maintenance commits race with the Ursa compaction writer, one of them retries. Schedule heavy operations (rewrite_data_files, rewrite_manifests) during low-write windows when possible.
  • Retention vs. time travel. expire_snapshots and remove_orphan_files permanently delete files. Choose a retention window that exceeds the longest expected read query and your time-travel SLA.
  • Schedule the workload. Most teams orchestrate these procedures from Databricks Jobs, AWS EMR steps, Airflow, Dagster, or a Kubernetes CronJob. Pick a scheduler that fits your existing operational stack.
  • Reference. See the Iceberg Spark procedures documentation for the full parameter list, including options for partial rewrites (where), file-size targets, and merge-on-read delete file compaction.

Delta (No Unity Catalog)

Delta tables are written directly to the configured storage path without Unity Catalog integration.
compactionScheduler:
  config:
    catalog.default: <catalog-name>
    lakehouseType: delta
    delta.catalog.<catalog-name>.directExternalStoragePath: "<bucket>/suffix"
    streamTableMode: "EXTERNAL"

Table maintenance

When Delta tables are written directly to object storage without a managed catalog, no service runs maintenance on your behalf. Streaming writes from the StreamNative Ursa compaction service produce many small Parquet files and accumulate Delta transaction-log history over time, which degrades query performance and inflates storage costs. You are responsible for scheduling and running maintenance against every Delta table written by Ursa. Run the maintenance operations below on a regular schedule. They can be executed from any Spark cluster (Databricks, AWS EMR, AWS Glue, GCP Dataproc, or self-managed Spark) that has the delta-spark runtime and IAM access to the warehouse bucket. For background and tuning recommendations, see the Databricks Delta Lake best practices guide. Maintenance operations
OperationPurposeSuggested cadence
OPTIMIZECompact small Parquet files into fewer, larger files (bin-packing). Reduces file-listing overhead and improves scan performance.Hourly to daily, depending on ingestion rate
OPTIMIZE … ZORDER BYCo-locate data by frequently filtered columns so query engines can skip more files.Weekly, or after large inserts
VACUUMDelete data files that are no longer referenced by the Delta log and are older than the retention threshold.Daily or weekly; retain at least 7 days so in-flight readers and time-travel queries keep working
Example: run maintenance from Spark Replace <path> with the table location (s3://..., gs://..., or abfss://...).
-- Compact small files. Default target file size is 1 GB.
OPTIMIZE delta.`<path>`;

-- Compact and co-locate by frequently filtered columns.
OPTIMIZE delta.`<path>` ZORDER BY (user_id, event_time);

-- Remove data files older than 7 days that are not referenced by the log.
VACUUM delta.`<path>` RETAIN 168 HOURS;
Operational guidance
  • Credentials. The principal that runs maintenance must have IAM read, write, list, and delete permissions on the warehouse bucket so it can rewrite and remove data files.
  • VACUUM retention. Do not reduce the retention window below 7 days without explicitly disabling the safety check (spark.databricks.delta.retentionDurationCheck.enabled = false). Shorter windows risk breaking concurrent readers and time-travel queries.
  • Concurrency. Delta uses optimistic concurrency control. Bin-packing OPTIMIZE runs do not generally conflict with append-only writes from Ursa, but OPTIMIZE … ZORDER BY rewrites larger portions of the table and is best scheduled in lower-write windows.
  • Schedule the workload. Most teams orchestrate maintenance from Databricks Jobs, AWS EMR steps, Airflow, Dagster, or a Kubernetes CronJob. Pick a scheduler that fits your existing operational stack.
  • Reference. See the Databricks Delta best practices and the Delta Lake utility commands for full syntax, retention options, and tuning guidance.

Multi-Catalog Example

Configure two catalogs (one Polaris, one S3Table) and set a default:
compactionScheduler:
  config:
    custom:
      # Default catalog
      catalog.default: polaris-prod

      # Catalog 1: Snowflake Open Catalog (Polaris)
      iceberg.catalog.polaris-prod.catalog-backend: "POLARIS"
      iceberg.catalog.polaris-prod.type: "rest"
      iceberg.catalog.polaris-prod.uri: "https://xyz.snowflakecomputing.com/polaris/api/catalog"
      iceberg.catalog.polaris-prod.credential: "<client-id>:<client-secret>"
      iceberg.catalog.polaris-prod.warehouse: "prod-catalog"

      # Catalog 2: AWS S3Table
      iceberg.catalog.s3table-analytics.catalog-backend: "S3TABLE"
      iceberg.catalog.s3table-analytics.type: "rest"
      iceberg.catalog.s3table-analytics.rest.sigv4-enabled: "true"
      iceberg.catalog.s3table-analytics.rest.signing-name: "s3tables"
      iceberg.catalog.s3table-analytics.rest.signing-region: "us-east-2"
      iceberg.catalog.s3table-analytics.uri: "https://s3tables.us-east-2.amazonaws.com/iceberg"
      iceberg.catalog.s3table-analytics.warehouse: "arn:aws:s3tables:us-east-2:123456789:bucket/analytics"
      iceberg.catalog.s3table-analytics.rest-metrics-reporting-enabled: "false"

      # Configure SDT
      streamTableMode: "EXTERNAL"
      
      # Configure to use Iceberg
      lakehouseType: "ICEBERG"
Then assign catalogs per namespace or topic:
# Use default (polaris-prod) for all topics in the namespace
pulsar-admin namespaces set-property -k catalog.name -v polaris-prod public/default

# Override for a specific topic to use S3Table
pulsar-admin topics update-properties \
  -p catalog.name=s3table-analytics \
  persistent://public/default/analytics-topic

Limitations

  • A namespace or topic can reference only one catalog at a time
  • You can assign different catalogs to different topics or namespaces
  • You cannot assign multiple catalogs to a single topic or namespace

Next Steps