Skip to content

Databricks

The Databricks connector is based on the Databricks SQL Driver for Node.js and is installed automatically with the Katalogue backend API service.

The connector does currently only support Databricks personal access token authentication. The connector requires the following information to connect to your Databricks instance:

  • Hostname, the Server Hostname value for your cluster or SQL warehouse.
  • Warehouse HTTP Path, the HTTP Path value for your cluster or SQL warehouse.
  • Access Token, the Databricks personal access token.

By default, Katalogue requires read access to the following resources:

  • system.information_schema.catalogs
  • system.information_schema.schemata
  • system.information_schema.tables
  • system.information_schema.views
  • system.information_schema.columns
  • system.information_schema.table_constraints
  • system.information_schema.referential_constraints
  • system.information_schema.key_column_usage
  • system.access.table_lineage

Note that other permissions might be required if custom import queries are used.

The Databricks connector does not sync source ids for views, tables etc. This means that the connector cannot handle table renames optimally (i.e. with an update). Instead, a renamed table causes a delete/insert, which means that any references to the table will be lost in the sync. The source ids are missing since they are not made available in the databricks information_schema.

Dataset statistics such as row count and disk size is not synced, as this data is not available in the databricks information_schema.

The SQL definition of views are only synced for the views where the user used in Katalogue to connect is the owner (or part of a group that is owner) to the view. This is due to limitations in the Databricks permissions setup.