Integrate With Databricks Unity Catalog

Introduction

This guide offers a detailed walkthrough for integrating StreamNative Cloud with Databricks Unity Catalog. It covers essential aspects such as configuring authentication, networking, storage buckets, catalogs, and other key components. Databricks Unity Catalog Integration is available with StreamNative BYOC Ursa clusters, which can be deployed into your AWS or GCP cloud account. Specific directions are included for each cloud provider where applicable.By following this guide, you will enable seamless interaction between StreamNative Cloud and Databricks Unity Catalog.

Setup Databricks

Before initiating the integration of Databricks with StreamNative Cloud, please ensure the following prerequisites are fulfilled. You can also watch this video to learn more about Preparing Databricks Account (AWS Example)

Cloud Service Provider Permissions:

Setting up a Databricks workspace requires appropriate AWS or GCP permissions. Ensure you are logged into your cloud provider account with an active session and administrative privileges to enable seamless authorization. To simplify the required permissions, we recommend you to use the same AWS or GCP account you used to create a StreamNative BYOC Cloud Environment.

Step 1: Create Databricks Workspace

Click Create workspace to proceed

AWS Choose Quickstart, and click Next

Enter your workspace name and choose the AWS region. In this example, the bucket is located in the us-east-2 region, so that region is selected. To avoid cross regional data transfer costs, we recommend you to choose the same region where the StreamNative Cloud Environment is created. Finally, click Start Quickstart to proceed.

The process will redirect you to the AWS Management Console. Enable the acknowledgment checkbox and click Create Stack to continue.

The system will then initiate the setup tasks in AWS, which may take some time to complete. After a few minutes, you will see the CREATE_COMPLETE event for the workspace, indicating the setup is successfully completed.

Next, return to the Databricks console. You will see that the workspace has been successfully created. Click Open to access the Unity Catalog console.

The Unity Catalog console will appear as illustrated below.

GCP Enter your workspace name and choose the GCP region. In this example, our storage bucket is located in the us-east4 region, so that region is selected. To avoid cross regional data transfer costs, we recommend you to choose the same region where the StreamNative Cloud Environment is created. Enter your Google cloud project ID and click Save.

You will see the workspace has been successfully created. Click Open to access the Unity Catalog console

The Unity Catalog console will appear as illustrated below.

Step 2: Configure external data access

Click Catalog → Settings → Metastore to proceed.

Enable the External Data Access option.

Step 3: Configure unity catalog access setting

Part A : Choose authentication method

There are two ways to authenticate and authorize StreamNative Cluster to access Databricks Unity Catalog.

Personal Access Token (PAT)

Databricks recommends using the PAT token for development and testing purposes only. In your Databricks workspace, click your Databricks username in the top bar, and then select Settings from the drop down. Then navigate to Developer → Access Tokens Manage.

Generate a new token

Copy the generated token.

OAuth2

Databricks recommends using OAuth2 for configuring authentication between StreamNative Cloud and Databricks Unity Catalog for production deployment.To configure OAuth2 Machine to Machine based authentication, follow the steps below to generate Client ID and Secret for a Service Principal. Within user settings of your database, navigate to Identity & Access, and click on ‘Manage’ button next to Service principals

Click on Add service principal to create a new principal or click on an existing service principal

Click on Secrets tab and then click on Generate secret to generate ClientId and Secret which will be used later while configuring a StreamNative Cluster.

Part B : Configure access permissions

Grant the necessary privileges for the catalog to ensure appropriate access permissions.

Principal:

For PAT token, Select All Accounts.
For OAuth2, Select Service Principal. Service Principal might not show up in the drop down and you may need to search for it.

Privilege Presets: Choose Data Editor, which will automatically select the relevant privileges. External Use Schema: Ensure this option is Enabled.

Step 4: Setup storage bucket

Choose bucket location and grant access to StreamNative Cloud. You have two choices to setup a storage bucket. Option 1: Use your own bucket You need to create your own storage bucket, with the option to create a bucket path. StreamNative will require access to this storage bucket. To grant access, execute the following Terraform module based on your cloud provider. AWS

external_id: StreamNative organization, directions after terraform modules for finding your StreamNative organization
role: the name of the role that will be created in AWS IAM, arn needed when creating cluster
buckets: bucket name and path
account_id: AWS account id

module "sn_managed_cloud" {
  source = "github.com/streamnative/terraform-managed-cloud//modules/aws/volume-access?ref=v3.19.0"

  external_id = "<your-organization-name>"
  role = "<your-role-name>"
  buckets = [
    "<your-bucket-name>/<your-bucket-path>",
  ]

  account_ids = [
    "<your-aws-account-id>"
  ]
}

GCP

streamnative_org_id: StreamNative organization, directions after terraform modules for finding your StreamNative organization
project: project name in GCP where bucket is located
cluster_projects: project name in GCP where StreamNative BYOC Ursa cluster is located
google_service_account_name: the name of the service account the will be created in GCP, service account email needed when creating cluster
buckets: the bucket name and path

module "sn_managed_cloud_access_bucket" {
  source = "[github.com/streamnative/terraform-managed-cloud//modules/gcp/volume-access](http://github.com/streamnative/terraform-managed-cloud//modules/gcp/volume-access)"
  streamnative_org_id = "<your-organization-id>"
  project = "<your-project-name>"
  cluster_projects = [
    "<your-pulsar-cluster-gcp-project-name>"
  ]

  google_service_account_name = "<your-google-service-account-name>"
  buckets = [
    "<your-bucket-name>/<your-bucket-path>",
  ]
}

You can find the organization name in StreamNative console, as shown below:

Before executing the Terraform module, you must provide your console access to your cloud provider AWS These variables are used to grant your console access to the AWS account where the S3 bucket is located. Learn more

export AWS_ACCESS_KEY_ID="<YOUR_AWS_ACCESS_KEY_ID>"
export AWS_SECRET_ACCESS_KEY="<YOUR_AWS_SECRET_ACCESS_KEY>"
export AWS_SESSION_TOKEN="<YOUR_AWS_SESSION_TOKEN>"

GCP The following commands are used to authenticate using a user account. Learn more

`gcloud init`
`gcloud auth application-default login`

Run the Terraform module for you cloud provider

terraform init
terraform plan
terraform apply

After applying terraform, you can find the email address of the google_service_account_name in GCP IAM. This will be needed when creating the StreamNative BYOC Ursa cluster. Navigate to IAM → Service Accounts, and find the Email of the Service Account.

Option2 : Use StreamNative provided bucket This process involves deploying the StreamNative BYOC Cloud Environment. StreamNative will automatically assign the necessary permissions to this bucket. To proceed, you will need to complete the steps for granting vendor access, creating a Cloud Connection, and setting up the Cloud Environment.

Step 5: Grant bucket permissions to the Databricks Unity Catalog role

You can choose one of the following to create credentials according to your requirements.

Databricks workspace is not under the same AWS account as your S3 bucket

Login to the aws console where the s3 bucket is located.
Refer to this document to create a storage credential.

Databricks workspace is under the same AWS account as your S3 bucket

During the Databricks workspace initialization, an AWS role is automatically created for the Unity Catalog. You can view the Unity Catalog role’s ARN (Amazon Resource Name) in the AWS console. AWS Click Catalog → Settings → Credentials to proceed.

Copy the ARN role as shown in the figure below. We will need this in the next steps.

To grant bucket permission to this role, follow these steps:

Access AWS IAM Console : Log in to the AWS Management Console and navigate to the IAM service.
Search for the Role : In the IAM dashboard, search for the IAM role.
View Role Details: : Click on the role to open its detail page.

Click on the attached policy for the role, then select Edit Policy to make the necessary modifications. Once you view the policy content, you will notice that it already grants permissions to the bucket, which was created by Databricks. Since we are not using this bucket, we need to modify the policy to grant permissions to our bucket.

Edit the policy to include the ARN of your bucket. This can be the bucket created by StreamNative or the user created bucket from Step 4. Please note that when entering the bucket name, you should specify only the bucket name itself and exclude the path. To view the name of the storage bucket created by StreamNative, log in to the AWS Cloud Console and navigate to the AWS S3 service. Search for a bucket with the following naming format: <YOUR_CLOUD_ENVIRONMENT_ID>-tiered-storage-snc.

Ensure the correct permissions are applied for access. For example: Update the policy to include the bucket arn for the root path and also with ’/*’ as shown in the picture below. In the image below we are using a custom bucket called ‘test-databrick-unity-catalog’

After editing the policy, click Next to review your changes, and then click Save Changes to apply the updated permissions. GCP Databricks workspace is not under the same GCP project as your storage bucket The terraform module for GCP providing StreamNative access to the user provided storage bucket requires you to specify the project name of both the storage bucket and the StreamNative BYOC Ursa cluster. The correct permissions will be applied to allow access across projects. Databricks workspace is under the same AWS Account or GCP Project as your storage bucket During the Databricks workspace initialization, an AWS Role or GCP Service Account is automatically created for the Unity Catalog. Complete the following to obtain this role or service account and provide it with the correct permissions. Click Catalog->Settings->Credentials to proceed.

Copy the GCP Service Account Email as shown in the image below.

In GCP, navigate to IAM -> Roles -> Create role

Input the role Title and ID. Select + Add permissions and add the following permissions.

storage.buckets.get
storage.objects.create
storage.objects.delete
storage.objects.get
storage.objects.list

Then click Create. Apply the role to the service account on the bucket. Find the custom storage bucket or tiered storage bucket of your StreamNative Cloud Environment. In the example below we are using the tiered storage bucket of the StreamNative Cloud Environment. Click Permissions → View by principals → Grant access.

Paste the Databricks Service Account Email and choose the role we just created.

Click SAVE. The Databricks Service Account can now access the storage bucket.

Setup StreamNative Cluster

Before creating a cluster, make sure you complete the steps for granting vendor access, creating a Cloud Connection, and setting up the Cloud Environment. You can also watch this video to learn more about deploying a StreamNative Cluster.

Step 1 : Create an Ursa cluster in StreamNative Cloud Console

In this section we will create and set up a cluster in StreamNative Cloud. Login to StreamNative Cloud and click on ‘Create an instance and deploy cluster’

Click on Deploy BYOC

Enter Instance name, select Cloud Connection, select URSA Engine and click on Cluster Location

Enter Cluster Name, select Cloud Environment, select Multi AZ

To configure Storage Location there are two options Select Use Existing BYOC Bucket to choose the bucket created by StreamNative

Select Use Your Own Bucket to choose your own storage bucket by entering the following details AWS

AWS role
Region
Bucket name
Bucket path,
Confirm that StreamNative has been granted the necessary permissions to access your S3 bucket . The required permissions were granted by running a Terraform module in Step4.

GCP

GCP service account: Use the complete email address which can be found in GCP IAM (created with terraform module, this is not the GCP Service Account Email for the Databricks Workspace)
Region
Bucket name
Bucket path
Confirm that StreamNative has been granted the necessary permissions to access your GCP storage bucket. The required permissions were granted by running a Terraform module.

Configure Catalog

Enable Catalog Integration
Within Lakehouse tables, select Managed Table
Select Databricks Unity Catalog, for Catalog Provider
Enter Unity Catalog Details
- Enter catalog name
- Enter Schema name
- Enter URI
Select Authentication Type : Personal Access Token (PAT) or OAuth2

There are two options for configuring authentication. The Personal Access Token (PAT) is a suitable option for development and testing. You can also verify that your token is working correctly in advance by following the steps below:

curl -X GET -H "Authorization: Bearer <your-personal-token>" \
    -H "Content-Type: application/json" \
    "https://dbc-<your-account>.cloud.databricks.com/api/2.1/unity-catalog/catalogs/<your-unity-catalog>"

The OAuth2 based authentication is recommended for production scenarios. You can also verify that your OAuth2 credential is working correctly in advance by following the steps below:

export CLIENT_ID=<client-id>
export CLIENT_SECRET=<client-secret>

  curl --request POST \
  --url https://dbc-<your-account>.cloud.databricks.com/oidc/v1/token \
  --user "$CLIENT_ID:$CLIENT_SECRET" \
  --data 'grant_type=client_credentials&scope=all-apis'

This generates a response similar to:

{
  "access_token": "eyJraWQiOiJkYTA4ZTVjZ…",
  "token_type": "Bearer",
  "expires_in": 3600
}

Copy the access_token from the response.

curl -X GET -H "Authorization: Bearer <access-token>" \
    -H "Content-Type: application/json" \
    "https://dbc-<your-account>.cloud.databricks.com/api/2.1/unity-catalog/catalogs/<your-unity-catalog>"

After entering all the details, deploy the cluster. Once the cluster is successfully deployed, you can move to the next step and populate data in the cluster.

Step 2: Create an external location for the databricks unity catalog

AWS The Unity Catalog requires an external location to access the S3 bucket. To create an external location, follow these steps:

Click on Create External Location

We need to select the Manual option because the AWS Quickstart is not suitable for our scenario. Quickstart automatically creates a new bucket during the setup process, whereas in our case, the bucket has already been pre-created. Therefore, choosing the Manual option aligns with our requirements.

Enter the details listed below to create an external location.

External Location Name: Enter any name of your choice.
URL: Specify the URL of the storage bucket.
For a StreamNative Provided Bucket, the path has the following format. s3://<CLOUD_ENVIRONMENT_ID>/<CLUSTER_ID>/compaction
For a User Provided Bucket, the path has the following format. s3://<CUSTOM_BUCKET>/<PATH>/compaction
Storage Credential: Select the IAM role from the drop down. You can fetch this role from the Databricks workspace you created in Step1.

Click Create, and the external location will be successfully created.

If you use OAuth2 to access the unity catalog, you should grant the permission to the service principal. Find your External Location -> Permissions -> Grant

GCP The Unity Catalog needs an external location to access the GCP bucket. Click Catalog → Settings → External Locations

Click Create external location

Input the following

External location name
URL: GCP storage bucket root
Storage credential: choose the Databricks Service Account

Click Create Click Test connection to verify access

Grant the external location to the service principal (if using OAuth2) or All account users (if using PAT). Click Permissions→Grant, then grant ALL PRIVILEGES.

Step 3: Produce Kafka messages to topic

Follow the creating and running a producer section to produce Kafka messages to a topic. After the external location is successfully created and produce message done, navigate to the next step.

Review Ingested Data In Databricks

Step 1: Check the Databricks Unity catalog console

In the Databricks Unity Catalog console, you will see that a table has already been created and is available for use.

[NOTE]: StreamNative Cloud adheres to the following conventions for converting special characters:

/ is replaced with __
- is replaced with ___
. is replaced with ____

Step 2: Check the storage bucket

The messages from the topic will be automatically offloaded to the configured storage bucket as shown in the figure below.

Step 3: View ingested data in Databricks unity catalog

At this point users can view the ingested data in the Unity Catalog as shown in the figure below. Login to Databricks workspace and navigate to the catalog to view the ingested data in the tables.

Introduction

Get Started

Clusters

Data Streams

Security

Governance

Connect

Lakehouse

Process

Networking

Log And Monitor

Universal Linking

Billing

References

Introduction

Setup Databricks

Cloud Service Provider Permissions:

Step 1: Create Databricks Workspace

Step 2: Configure external data access

Step 3: Configure unity catalog access setting

Part A : Choose authentication method

Personal Access Token (PAT)

OAuth2

Part B : Configure access permissions

Step 4: Setup storage bucket

Step 5: Grant bucket permissions to the Databricks Unity Catalog role

Databricks workspace is not under the same AWS account as your S3 bucket

Databricks workspace is under the same AWS account as your S3 bucket

Setup StreamNative Cluster

Step 1 : Create an Ursa cluster in StreamNative Cloud Console

Step 2: Create an external location for the databricks unity catalog

Step 3: Produce Kafka messages to topic

Review Ingested Data In Databricks

Step 1: Check the Databricks Unity catalog console

Step 2: Check the storage bucket

Step 3: View ingested data in Databricks unity catalog

Introduction

Get Started

Clusters

Data Streams

Security

Governance

Connect

Lakehouse

Process

Networking

Log And Monitor

Universal Linking

Billing

References

​Introduction

​Setup Databricks

​Cloud Service Provider Permissions:

​Step 1: Create Databricks Workspace

​Step 2: Configure external data access

​Step 3: Configure unity catalog access setting

​Part A : Choose authentication method

Personal Access Token (PAT)

OAuth2

​Part B : Configure access permissions

​Step 4: Setup storage bucket

​Step 5: Grant bucket permissions to the Databricks Unity Catalog role

​Databricks workspace is not under the same AWS account as your S3 bucket

​Databricks workspace is under the same AWS account as your S3 bucket

​Setup StreamNative Cluster

​Step 1 : Create an Ursa cluster in StreamNative Cloud Console

​Step 2: Create an external location for the databricks unity catalog

​Step 3: Produce Kafka messages to topic

​Review Ingested Data In Databricks

​Step 1: Check the Databricks Unity catalog console

​Step 2: Check the storage bucket

​Step 3: View ingested data in Databricks unity catalog

Introduction

Setup Databricks

Cloud Service Provider Permissions:

Step 1: Create Databricks Workspace

Step 2: Configure external data access

Step 3: Configure unity catalog access setting

Part A : Choose authentication method

Part B : Configure access permissions

Step 4: Setup storage bucket

Step 5: Grant bucket permissions to the Databricks Unity Catalog role

Databricks workspace is not under the same AWS account as your S3 bucket

Databricks workspace is under the same AWS account as your S3 bucket

Setup StreamNative Cluster

Step 1 : Create an Ursa cluster in StreamNative Cloud Console

Step 2: Create an external location for the databricks unity catalog

Step 3: Produce Kafka messages to topic

Review Ingested Data In Databricks

Step 1: Check the Databricks Unity catalog console

Step 2: Check the storage bucket

Step 3: View ingested data in Databricks unity catalog