1. Integrate with Data Lakehouse
  2. Ursa Engine

Integrate With Snowflake Open Catalog

Introduction

This guide offers a detailed walkthrough for integrating StreamNative Cloud with Snowflake Open Catalog. It covers essential aspects such as configuring authentication, storage buckets, catalogs, and other key components. Snowflake Open Catalog Integration is available with StreamNative BYOC Ursa clusters, which can be deployed into your AWS or GCP cloud account. Specific directions are included for each cloud provider where applicable.By following this guide, you will enable seamless interaction between StreamNative Cloud and Snowflake Open Catalog.

Setup Snowflake Open Catalog

Before initiating the integration of Snowflake Open Catalog with StreamNative Cloud, please ensure the following steps are completed. You can also watch this video to learn more about Preparing Snowflake Open Catalog AWS Example

Step 1: Create Snowflake AI Data Cloud Account:

Create a Snowflake AI Data Cloud account. The homepage of a Snowflake AI Data Cloud account will look as follows.

Create Snowflake Standard Account

Step 2: Create Snowflake Open Catalog Account

To access the Snowflake Open Catalog console, a specialized Open Catalog account must be created. This account type is specifically designed for managing Open Catalog features and functionality.

Enter Admin → Accounts → Toggle → Create Snowflake Open Catalog Account

Create Snowflake Open Catalog Account

Configure the Snowflake Open Catalog

  • Cloud: cloud provider used to deploy Snowflake Open Catalog, use same cloud provider where StreamNative BYOC Ursa cluster is deployedAWS
  • Region: region to place the Snowflake Open Catalog, use same region as StreamNative BYOC Ursa cluster

The Snowflake Open Catalog, storage bucket, and StreamNative BYOC Ursa cluster should be in the same cloud provider and region.Snowflake Open Catalog doesn’t support cross-region buckets. To avoid costs associated with cross-region traffic, we highly recommend your storage bucket and StreamNative BYOC Ursa cluster are in the same region.

Edition: any

Create Snowflake Open Catalog Account

Next, input a Snowflake Open Catalog Account Name, User Name,Password, and Email. This will create a new user for use specifically with the Snowflake Open Catalog Account.

Enter Snowflake Open Catalog Account Details

Click Create Account. You will see the following if account creation is successful. We highly recommend taking a screenshot of this confirmation message. This Account Locator URL will be used in later steps.

Create Snowflake Open Catalog Account

Click the Account URL, then sign into your open catalog account. You will enter the Snowflake Open Catalog console.

Click Account URL

If you need the Account URL of your Snowflake Open Catalog Account in the future, navigate to Admin → Accounts → … → Manage URLs of your Snowflake Account.

Click Account URL

Step 3. Setup storage bucket with permissions for StreamNative

Choose bucket location and grant access to StreamNative Cloud. You have two choices to setup a storage bucket.

The Snowflake Open Catalog, storage bucket, and StreamNative BYOC Ursa cluster should be in the same cloud provider and region.** Snowflake Open Catalog doesn’t support cross-region buckets. To avoid costs associated with cross-region traffic, we highly recommend your storage bucket and StreamNative BYOC Ursa cluster are in the same region.

Option 1: Use your own bucket (recommended)

You need to create your own storage bucket, with the option to create a bucket path. When using your own bucket, the resulting path you will use for creation of the Snowflake Open Catalog will be as follows. The compaction folder will be created automatically by the StreamNative cluster.

AWS s3://<your-bucket-name>/<your-bucket-path>/compaction

GCP gs://<your-bucket-name>/<your-bucket-path>/compaction

StreamNative will require access to this storage bucket. To grant access, execute the following Terraform module based on your cloud provider.

StreamNative will require access to this storage bucket. To grant access, execute the following Terraform module.

AWS

  • external_id: StreamNative organization, directions after terraform modules for finding your StreamNative organization
  • role: the name of the role that will be created in AWS IAM, arn needed when creating cluster
  • buckets: bucket name and path
  • account_ids: AWS account id
module "sn_managed_cloud" {
source = "github.com/streamnative/terraform-managed-cloud//modules/aws/volume-access?ref=v3.19.0"

external_id = "<your-organization-name>"
role = "<your-role-name>"
buckets = [
"<your-bucket-name>/<your-bucket-path>",
]

account_ids = [
"<your-aws-account-id>"
]
}

GCP

  • streamnative_org_id: StreamNative organization, directions after terraform modules for finding your StreamNative organization
  • project: project name in GCP where bucket is located
  • cluster_projects: project name in GCP where StreamNative BYOC Ursa cluster is located
  • google_service_account_name: the name of the service account the will be created in GCP, service account email needed when creating cluster
  • buckets: the bucket name and path
module "sn_managed_cloud_access_bucket" {
  source = "github.com/streamnative/terraform-managed-cloud//modules/gcp/volume-access?ref=v3.20.0"

  streamnative_org_id = "<your-organization-id>"

  project = "<your-project-name>"

  cluster_projects = [
    "<your-pulsar-cluster-gcp-project-name>"
  ]

  google_service_account_id = "<your-google-service-account-id>"

  buckets = [
    "<your-gcs-bucket-path>",
  ]
}

You can find your organization name in the StreamNative console, as shown below:

Click Account URL

Before executing the Terraform module, you must provide your console access to your cloud provider. These variables are used to grant your console access to the AWS account where the S3 bucket is located.

Learn more about how to configure AWS CLI here.

AWS

export AWS_ACCESS_KEY_ID="<YOUR_AWS_ACCESS_KEY_ID>"
export AWS_SECRET_ACCESS_KEY="<YOUR_AWS_SECRET_ACCESS_KEY>"
export AWS_SESSION_TOKEN="<YOUR_AWS_SESSION_TOKEN>"

GCP

The following commands are used to authenticate using a user account.

https://cloud.google.com/docs/terraform/authentication
`gcloud init`
`gcloud auth application-default login`

Run the Terraform module

terraform init
terraform plan
terraform apply

Option 2: Use StreamNative provided bucket

This process involves deploying the StreamNative BYOC Cloud Connection, Cloud Environment, and beginning the process of deploying the StreamNative BYOC Ursa Cluster to obtain the cluster id. StreamNative will automatically assign the necessary permissions to this bucket.

To proceed, you will need to first complete the steps for granting vendor access, creating a Cloud Connection, and setting up the Cloud Environment. Next, begin the process of deploying the StreamNative BYOC Ursa Cluster to obtain the cluster id. Step 1 of Create StreamNative BYOC Ursa Cluster below includes directions on obtaining the cluster id from the Lakehouse Storage Configuration page.

When using a StreamNative-provided bucket, the resulting path you will use for creation of the Snowflake Open Catalog will be as follows. The cloud environment id will be created during the deployment of the Cloud Environment. The cluster id is assigned when starting the cluster creation process in the StreamNative Console and is found on the Lakehouse Storage Configuration page.

AWS s3://<your-cloud-environement-id>/<your-cluster-id>/compaction

GCP gs://<your-cloud-environment-id-tiered-storage>/<your-cluster-id>/compaction

Step 4: Configure Cloud Provider Account for Snowflake Open Catalog Access

AWS Create IAM policy and role for Snowflake Open Catalog Access.

In the AWS console, enter Access management → Policies → Create policy

Click AWS Policy

Then choose the JSON format. Enter the rule as follows, replacing <your-bucket-name> and <your-bucket-path>

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:PutObject",
                "s3:GetObject",
                "s3:GetObjectVersion",
                "s3:DeleteObject",
                "s3:DeleteObjectVersion"
            ],
            "Resource": "arn:aws:s3:::<your-bucket-name>/<your-bucket-path>/*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket",
                "s3:GetBucketLocation"
            ],
            "Resource": "arn:aws:s3:::<your-bucket-name>/<your-bucket-path>",
            "Condition": {
                "StringLike": {
                    "s3:prefix": [
                        "*"
                    ]
                }
            }
        }
    ]
}

Click Next

Click AWS Policy

Provide a policy name and click Create policy.

Click AWS Policy

Create IAM Role

In the AWS console, enter Access management → Roles → Create role

Click AWS Role

  • Trusted entity type: AWS account
  • An AWS account: this account Enable External ID

Set External ID: training_test (will be used when creating catalog)

Click Next

Set AWS External Id

Select the policy created in the previous step. Then click Next

Set AWS External Id

Input a role name and click Create role.

Create role

View the detailed role information and record the ARN

Create ARN

This policy and role are used for Snowflake Open Catalog access to the s3 bucket.

GCP

Create a role for Snowflake Open Catalog bucket access.

Navigate to IAM → Roles → Create role

Create Role

Provide a role title and ID (e.g. streamnative_pulsar_open_catalog).

Create Title and Id

Provide the following permissions:

  • storage.buckets.get
  • storage.objects.create
  • storage.objects.delete
  • storage.objects.get
  • storage.objects.list

Click Create. This role with be used by Snowflake Open Catalog to access the bucket.

Assigned permissions in Google Cloud

Step 5: Create Snowflake Open Catalog

Create Snowflake Open Catalog

Create ARN

AWS

  • Name: Supply a catalog name (example : streamnative)
  • External: disabled
  • Storage provider: S3
  • Default base location:

User provided bucket:

s3://<your-bucket-name>/<your-bucket-path>/compaction

StreamNative provided bucket:

s3://<your-cloud-environement-id>/<your-cluster-id>/compaction

  • Additional location: not configured
  • S3 role ARN: arn copied from previous step
  • External ID: external id created in previous step

Create ARN

Then click Create, you will see the catalog streamnative created

View the catalog details and capture the value of the IAM user arn. The Snowflake Open Catalog will use this arn to access our AWS bucket.

Create ARN

Trust the Snowflake Open Catalog Iam user arn

In the AWS console, enter Access management → Roles, search for the role we created before.

Create ARN

Then click Trust relationships → edit trust policy

Change the value of Principal:AWS to the Snowflake Open Catalog IAM user arn

Create ARN

Then click Update policy and the Snowflake Open Catalog can access the bucket.

GCP

  • Name: supply a catalog name (e.g. pulsar)
  • External: disabled
  • Storage provider: GCS
  • Default base location:
  • User provided bucket:
    • gs://<your-bucket-name>/<your-bucket-path>/compaction
  • StreamNative provided bucket:
    • gs://<your-cloud-environment-id>/<your-cluster-id>/compaction

Additional locations: not configured

Create catalog

Then click Create, you will see the catalog pulsar created.

Select the catalog and Catalog Details. Here we need to record the value of the GCP_SERVICE_ACCOUNT. The Snowflake Open Catalog will use this account to access our storage bucket.

Create catalog details

Navigate to Cloud Storage -> Buckets and select the root of the storage bucket.

  • User provided bucket:
    • <your-bucket-name>
  • StreamNative provided bucket (will include your cloud environment id and tiered-storage in the bucket name):
    • <your-cloud-environment-id-tiered-storage>

Select Permissions-> Grant access

Create catalog details

New principals: paste the GCP_SERVICE_ACCOUNT from the catalog

Role: paste the name of the role created in the previous step (e.g. streamnative_pulsar_open_catalog)

Create catalog details

Snowflake Open Catalog now has access to the GCP storage bucket.

Step 6: Provide StreamNative Access to Snowflake Open Catalog

Our engine needs a connection to access the Snowflake Open Catalog, so we need to create one. We will later reuse this connection for Snowflake to access Snowflake Open Catalog.

Create ARN

  • Name: streamnativeconnection
  • Query Engine: not configured
  • Create new principal role: enable
  • Principal Role Name: streamnativeprincipal

Create ARN

Then click Create, and you will see a pane. Record the Client ID and Client Secret for this connection as <CLIENT ID>:<SECRET>. Our engine needs it to access the Snowflake Open Catalog.

Create ARN

We now have a Service Connection called streamnativeconnection linked to the Principal Role streamnativeprincipal.

Create a Snowflake Catalog Role

Enter catalogsdetail catalog streamnative (or pulsar) → Roles → + Catalog Role

Name: streamnativeopencatalog

Privileges:

  • NAMESPACE_CREATE
  • NAMESPACE_LIST
  • TABLE_CREATE
  • TABLE_LIST
  • TABLE_READ_DATA
  • TABLE_WRITE_DATA
  • TABLE_READ_PROPERTIES
  • TABLE_WRITE_PROPERTIES
  • NAMESPACE_READ_PROPERTIES
  • NAMESPACE_WRITE_PROPERTIES

Click Create.

Create ARN

Then click Grant to Principal Role

Create ARN

  • Catalog role to grant: streamnative_open_catalog_role
  • Principal role to receive grant: streamnativeprincipal

Then click Grant

Create ARN

The catalog role streamnative_open_catalog_role now has the 10 required permissions on catalog streamnative (or pulsar).

We will resuse the connection when connecting Snowflake to Snowflake Open Catalog.

Create StreamNative BYOC Ursa Cluster

To proceed, you will need to first complete the steps for granting vendor access, creating a Cloud Connection, and setting up the Cloud Environment.Then you can begin the process of deploying the StreamNative BYOC Ursa Cluster. You can also watch this video to learn more about deploying the StreamNative BYOC Ursa Cluster (AWS Example).

Step 1: Create a StreamNative BYOC Ursa Cluster in StreamNative Cloud Console

In this section we create and set up a cluster in StreamNative Cloud. Login to StreamNative Cloud and click on ‘Create an instance and deploy cluster’

Create new instance

Click on Deploy BYOC

Deploy BYOC

Enter Instance name, select your Cloud Connection, select URSA Engine and click on Cluster Location

Enter instance name

Enter Cluster Name, select your Cloud Environment, select Multi AZ and click on Lakehouse Storage Configuration

Enter cluster details

To configure Storage Location there are two options

Option 1: Select Use Your Own Bucket to choose your own storage bucket by entering the following details

AWS

  • AWS role arn (created with terraform module)
  • Region
  • Bucket name
  • Bucket path
  • Confirm that StreamNative has been granted the necessary permissions to access your S3 bucket. The required permissions were granted by running a Terraform module.

Enter Lakehouse Storage Configuration

GCP

  • GCP service account:, use the complete email address which can be found in GCP IAM (service account was created with terraform module)
  • Region
  • Bucket name
  • Bucket path
  • Confirm that StreamNative has been granted the necessary permissions to access your GCP storage bucket. The required permissions were granted by running a Terraform module.

GCP Use Your Own Bucket

Option 2: Select Use Existing BYOC Bucket to choose the bucket created by StreamNative

Use Existing BYOC Bucket

The UI will present you with the SN Bucket Location in this format to be used when creating the Snowflake Open Catalog.

AWS s3://<your-cloud-environement-id>/<your-cluster-id>/compaction

e.g. s3://aws-usw2-test-rni68-tiered-storage-snc/o-naa2l-c-vo06zqe-ursa/compaction

GCP

gs://<your-cloud-environment-id-tiered-storage>/<your-cluster-id>/compaction

e.g. gs://gcp-usw2-test-hhein-tiered-storage>/o-78m1b-c-9ahma2v-ursa/compaction

If you are using the StreamNative provided bucket, do not close the browser while creating the catalog.** This will cause StreamNative to create a new cluster id. Once a catalog is created in Snowflake Open Catalog, the base location and additional locations cannot be changed. If the cluster id changes, you would need to create a new catalog.

To integrate with Snowflake Open Catalog, Enable Catalog Integration and select Snowflake Open Catalog.

  • Warehouse: catalog created in Snowflake Open Catalog (example : streamnative or pulsar)
  • URI: Account URL when creating Snowflake Open Catalog. Append '/polaris/api/catalog' to the URI. Look at the screen shot below.
  • Select Authentication Type/OAuth2: create a new secret in StreamNative using Snowflake Open Catalog Service Connection “<CLIENT ID>:<SECRET>”

Enable Snowflake Open Catalog

Clicking Cluster Size will test the connection to the storage bucket and the Snowflake Open Catalog.

Click Deploy

Click Continue to begin sizing your cluster.

For this example, we deploy using the smallest cluster size. Click Finish to start deploying the StreamNative BYOC Ursa Cluster into your Cloud Environment.

Cluster Sizing

When cluster deployment is complete, it will appear on the Organization Dashboard with a green circle.

View Organization Dashboard

The Lakehouse Storage configuration can be viewed by clicking on the Instance on the Organization Dashboard and selecting Configuration in the left pane.

View Lakehouse Storage Configuration

Step 2: Produce Kafka messages to topic

Follow the creating and running a producer section to produce Kafka messages to a topic.

Step 3: Review storage bucket

AWS

Navigate to the user provided or StreamNative provided s3 bucket. In this example the user provided bucket is s3://streamnativeopencatalog/test. A storage folder and compaction folder have been created by the cluster.

Review S3 bucket

We published messages to multiple topics in the the public/default tenant/namespace. We see folders for the tenant, namespace, and each topic inside the compaction folder.

View data in S3 bucket

Inside each topic folder, we find partition and metadata folders.

View partitions and metadata folders

GCP

Navigate to the user provided or StreamNative provided GCP storage bucket. In this example the StreamNative provided bucket is gs://gcp-usw2-test-hhein-tiered-storage/o-78m1b-c-37kll59-ursa. A storage folder and compaction folder have been created by the cluster.

Storage bucket in Google Cloud

We published messages to topic kafkatopic1 in the the public/default tenant/namespace. We see folders for the public tenant, default namespace, and kafkatopic1 topic inside the compaction folder. Inside each topic folder, we find partition and metadata folders.

Published message in Storage bucket in Google Cloud

Step 4: Verify Tables and Schema are Visible in Snowflake Open Catalog

Once we have published messages to a topic and the compaction folder has been created in the s3 bucket, we can verify the tables and schemas are visible in Snowflake Open Catalog. We can see the resulting topics created in streamnative/public/default with a registered schema.

Verify tables and schemas

Configure Snowflake to View Data from Snowflake Open Catalog

Querying a table in Snowflake Open Catalog using Snowflake requires completing the following from the Snowflake documentation. This video shows detailed queries for the above example (AWS Example).

Step 1: Create an external volume in Snowflake

Please refer to the Snowflake documentation here for the complete code samples for creating an external volume for each cloud provider.

The video includes the following details from our AWS example:

  • When creating the new policy for Snowflake to access the s3 bucket, use root of the s3 bucket to avoid a list error when verifying storage access.
  • When creating an external volume in Snowflake, for STORAGE_BASE_URL use the complete bucket path with s3://<>/<>/compaction.

Step 2: Create a catalog integration for Open Catalog

Please refer to the Snowflake documentation here for the complete code samples.

The video includes the following details from our AWS example:

  • The CATALOG_NAMESPACE refers to the tenant.namespace in our StreamNative Cluster. Since we published messages to public.default, use public.default as the CATALOG_NAMESPACE.
  • We can resuse the <CLIENT ID>:<SECRET> for Snowflake Open Catalog to allow access for Snowflake. The <CLIENT ID> refers to OAUTH_CLIENT_ID and <SECRET> refers to OAUTH_CLIENT_SECRET.

You will need to create a new catalog integration for each tenant.namespace.

Step 3: Create an externally managed table

Please refer to the Snowflake documentation here for the complete code samples.

The video includes the following details from our AWS example:

  • A Snowflake Open Catalog warehouse.schema.table (e.g. streamnative.public.default.kafkaschematopic) is mapped to a Snowflake database.schema.table (e.g. training.public.kafkaschematopic)
  • Use AUTO_REFRESH = TRUE; in CREATE ICEBERG TABLE to ensure new data is viewable in Snowflake.

You will need to create a new externally managed table for each topic.

Once completing these steps, you will be able to query the Iceberg Table registered in Snowflake Open Catalog through Snowflake AI Data Cloud.

Previous
Integrate Databricks Unity Catalog