Databricks

Installation

The Databricks connector is based on the Databricks SQL Driver for Node.js and is installed automatically with the Katalogue backend API service.

Authentication

This connector supports the following authentication methods:

Personal Access Token (PAT)
OAuth M2M (Microsoft Entra Id managed service principal/app registration)
OAuth M2M (Databricks managed service principal)

The connector requires the following information to connect to your Databricks instance:

Hostname, the Server Hostname value for your cluster or SQL warehouse.
Warehouse HTTP Path, the HTTP Path value for your cluster or SQL warehouse.
Credentials, depending on authentication mode:
- Access Token, only when authenticating with Databricks personal access tokens (PATs).
- Client Id, only when authenticating with OAuth M2M.
- Client Secret, only when authenticating with OAuth M2M.
- Azure Tenant Id, only when authenticating with OAuth M2M (Databricks managed sp) and when Databricks is hosted on Azure.

Required Permissions

By default, Katalogue requires read access to the following tables in Databricks:

system.information_schema.catalogs
system.information_schema.schemata
system.information_schema.tables
system.information_schema.views
system.information_schema.columns
system.information_schema.table_constraints
system.information_schema.referential_constraints
system.information_schema.key_column_usage
system.access.table_lineage

Note that other permissions might be required if custom import queries are used.

The following snippet shows how to set sufficient permissions in Databricks:

-- Grant read access to the required tables.
-- Users/resources normally have read access to system.information_schema,
-- whilst SELECT on system.access.* must be explicitly granted.
GRANT USE CATALOG ON CATALOG system TO `<user/resource_id>`;
GRANT USE CATALOG ON SCHEMA system.access TO `<user/resource_id>`;
GRANT SELECT ON TABLE system.access.table_lineage TO `<user/resource_id>`;

-- BROWSE on relevant catalogs is normally required
-- for data to show up in the information_schema tables.
-- For each catalog/database that is to be synced with Katalogue:
GRANT BROWSE ON CATALOG your_catalog TO `<user/resource_id>`;

Limitations

Source Ids

The Databricks connector does not sync source ids for views, tables etc. This means that the connector cannot handle table renames optimally (i.e. with an update). Instead, a renamed table causes a delete/insert, which means that any references to the table will be lost in the sync. The source ids are missing since they are not made available in the databricks information_schema.

Dataset Statistics

Dataset statistics such as row count and disk size is not synced, as this data is not available in the databricks information_schema.

View SQL Definition

The SQL definition of views are only synced for the views where the user used in Katalogue to connect is the owner (or part of a group that is owner) to the view. This is due to limitations in the Databricks permissions setup.