Databricks Unity Catalog
Description
Databrick's Unity Catalog provides a unified governance solution for managing and securing data assets across cloud environments. With Onehouse's OneTable support, users can ingest and store data in Delta Lake format, and these tables automatically sync to Unity Catalog. This ensures seamless management of metadata, permissions, and lineage in a centralized, secure catalog.
Setup guide
- Create a workspace (if not already created) in Databricks.
- Refer to the Databricks docs for workspace creation.
- You will also need to enable Unity Catalog in the workspace. The Databricks documentation mention that the use of a metastore is optional. For Onehouse, this metastore configuration is a requirement for Unity catalog sync to work. See the Databricks documentation for Managing Unity Catalog along with Create a Unity Catalog for details on how to setup the Unity Catalog and Databricks Metastore for your workspace.
Note: Databricks requires Delta Lake table metadata to work with Onehouse tables. To meet this requirement, be sure you have created a OneTable catalog with Delta Lake enabled, have have attached it to the relevant Stream Capture.
- To sync your Onehouse tables with Databricks Unity Catalog, you must grant Databricks access to the storage location where your Onehouse tables are stored. This involves creating a Databricks Storage Credential and External Location.
- Create a Databricks Storage Credential.
- Refer to the Databricks documentation: Create a storage credential for connecting to AWS S3.
- You can reuse the AWS Role created earlier for Unity Catalog access, and simply attach the new policy for Onehouse S3 customer bucket access.
- Create a Databricks External Location.
- Since you are referencing an existing Onehouse S3 bucket, refer to Creating an external location manually using Catalog Explorer.
- Create the Databricks Unity Catalog metadata catalog in Onehouse.
- Attach the Onehouse Databricks catalog to one or more Stream Captures.
Parameters to be passed while creating a Databricks Unity Catalog in Onehouse
- Enter a Name to identify the data catalog in Onehouse
- Enter the catalog name in Databricks Unity Catalog where the tables will be synced. If the catalog doesn't already exist, it will be created.
- Enter the Databricks compute resource’s
Server Hostname
andHTTP Path
value. You can find these values in the Databricks explorer by navigating toSQL -> SQL Warehouses -> (choose your warehouse) -> Connection details
. - Select the Auth Type.
- For Access Token Auth Type, Enter the
personal-access-token
for your workspace user. Refer to the Databricks docs for personal access token generation. - For OAuth type, Enter the service principal credentials
client-id
andclient-secret
from your workspace. Refer to the Databricks docs for service principal creation and to retrieve it's credentials
Example scenario
A Onehouse Table with Table Name=orders_by_product
in Onehouse Database with Database Name=orders
will get synced to Databricks unity catalog in given input catalog as Schema=orders
and Table=orders_by_product
. The schema named orders
will be created in the Databricks unity catalog if it doesn't already exist.
Role permissions
The following permissions are required for the user account used for connection: Refer to the Databricks docs for managing privileges in Unity Catalog.
GRANT CREATE CATALOG ON METASTORE TO `user_name`;
GRANT ALL PRIVILEGES ON CATALOG `catalog-name` TO `user_name`; --> catalog-name is the input catalog name in Databricks Unity Catalog (if not already created, it will be created)