Datasource Syncing

Eventhough you can add datasets, fields etc manually in Katalogue, it is highly recommended to use the datasource sync functionality to automatically extract and sync metadata from your datasources. To set this up, you must first create a connection in Katalogue and configure it with connection credentials (see supported connectors in the Connectors section). Then, you must create a datasource sync task for the datasource.

Creating a Connection

Custom Import Queries

This feature is supported for the following Connection Types:

Databricks
IBM DB2
Microsoft SQL Server
Oracle
PostgreSQL
Snowflake

Katalogue will by default use pre-defined SQL queries for each connector when connecting to a relational database. It is possible to override these default queries with user-defined custom queries. Some use cases for this are:

Provide custom filters instead of using Katalogue’s datasource filters feature described below.
Use custom logic to retrieve e.g. primary key definitions. Useful in scenarios where actual primary key constraints are missing but where the primary keys can be derived from unique indexes, naming conventions etc.
You do not want to, or do not have access to, connect to the actual source system directly and therefore use the datasource as an intermediary.

Katalogue uses three import queries to get all metadata it needs;

Base Import Query Imports columns, datasets, schemas and related metadata such as column datatype, primary keys etc.
Relationship Import Query Imports relationships (PK/FK constraints) between tables.
Lineage Import Query Imports data lineage between tables.

Define a custom query like so:

Create a Connection and select a Connection Type that support custom import queries
Go to the “ADVANCED” tab and enable “Use Custom Base Import Query”.
Click the “Edit Base Import Query” button to open a dialog where you can enter a custom SQL select statement. See requirements below.
It is possible to retrieve the default import query for the Connection Type by clicking the “Insert Default Query” button in the dialog
Click “Save” to save the query and close the dialog
Optionally, repeat the steps above for the “Use Custom Relationship Import Query” and “Use Custom Lineage Import Query” option to provide a custom SQL select statement for PK/FK constraints and dataset lineage.

Note that SELECT-statements are the only supported SQL statement. Katalogue will validate and sanitize the query before execution to prevent SQL injections.

Custom Base Import Query

The custom import query must return all columns with the exact names (case sensitive) listed below. The mandatory columns must be present, cannot be null or an empty string. They are the minimum data needed for Katalogue to correctly import data.

The columns suffixed with “_source_id” are strongly recommended. Including them enables the datasource sync to handle renames, as assets are primarily matched by name, or if not found, source id. If there is no source id, Katalogue will perform a delete and insert, which means that associations with Field Descriptions, custom descriptions etc will get lost.

Column Name	Datatype	Description	Is Mandatory
datasource_name	string	Database name	Yes
dataset_group_name	string	Schema name	Yes
dataset_name	string	Name of the table/view	Yes
dataset_type_name	string	Type of dataset. Must be either “table” or “view”.	Yes
field_name	string	Column name	Yes
datatype_name	string	Datatype name. Should be the raw name without appended precision and scale.	Yes
datatype_length	integer	Character length
datatype_precision	integer	Numeric precision
datatype_scale	integer	Numeric scale
datasource_type_name	string	Database type. If present, this will override the datasource type derived from the connector. I.e. it is possible to set this to “Microsoft SQL Server” eventhough the connector is to an Oracle database. The datasource will then show up as a “Microsoft SQL Server” datasource in the GUI instead of an Oracle datasource.
datasource_source_id	string	Unique, internal Id of the database in the source database. If present, Katalogue will use this to handle renames better by doing an update instead of a delete & insert statement.
dataset_group_source_id	string	Unique, internal Id of the schema in the source database. If present, Katalogue will use this to handle renames better by doing an update instead of a delete & insert statement.
dataset_source_id	string	Unique, internal Id of the table/view in the source database. If present, Katalogue will use this to handle renames better by doing an update instead of a delete & insert statement.
field_source_id	string	Unique, internal Id of the column in the source database. If present, Katalogue will use this to handle renames better by doing an update instead of a delete & insert statement.
dataset_row_count	bigint	Number of rows in the dataset. Normaly taken from a statistics table (much faster than a “select count(*)“).
dataset_size_bytes	bigint	Total disk size in bytes of the dataset.
is_primary_key	boolean	True if the column is part of a primary key constraint
is_nullable	boolean	True if the column can be NULL
ordinal_position	integer	Ordinal position of the column
default_value	string	Default value of the column
dataset_url	string	URL of the dataset. This will show up as a clickable link in the GUI
datasource_source_created	timestamp	Timestamp for when the database was created
dataset_group_source_created	timestamp	Timestamp for when the schema was created
dataset_source_created	timestamp	Timestamp for when the table/view was created
datasource_source_modified	timestamp	Timestamp for when the database was last altered
dataset_group_source_modified	timestamp	Timestamp for when the schema was last altered
dataset_source_modified	timestamp	Timestamp for when the table/view was last altered
dataset_technical_definition	string	Normally the SQL statement for views, but can be any transformation logic to populate a table or view.
dataset_technical_definition_raw	string	This can be used if the SQL statement/transformation logic is dynamically created. This is the case for e.g. dbt datasources, then this column will contain the raw, uncompiled dbt code, and the “dataset_technical_definition” column contains the compiled SQL.
datasource_description	string	Comment or description of the database
dataset_group_description	string	Comment or description of the schema
dataset_description	string	Comment or description of the table/view
field_description	string	Comment or description of the column

Example query:

SELECT
    "datasource_name"
  , "dataset_group_name"
  , "dataset_name"
  , "dataset_type_name"
  , "field_name"
  , "datatype_name"
  , "datatype_length"
  , "datatype_precision"
  , "datatype_scale"
  , "datasource_type_name"
  , "datasource_source_id"
  , "dataset_group_source_id"
  , "dataset_source_id"
  , "field_source_id"
  , "dataset_row_count"
  , "dataset_size_bytes"
  , "is_primary_key"
  , "is_nullable"
  , "ordinal_position"
  , "default_value"
  , "dataset_url"
  , "datasource_source_created"
  , "dataset_group_source_created"
  , "dataset_source_created"
  , "datasource_source_modified"
  , "dataset_group_source_modified"
  , "dataset_source_modified"
  , "dataset_technical_definition"
  , "dataset_technical_definition_raw"
  , "datasource_description"
  , "dataset_group_description"
  , "dataset_description"
  , "field_description"

FROM (
  -- information schema, system tables etc.
)

Custom Relationship Import Query

The custom relationship import query must return columns with these exact names (case sensitive). The mandatory columns must be present, cannot be null or an empty string. They are the minimum data needed for Katalogue to correctly import data.

Column Name	Datatype	Description	Is Mandatory
relationship_name	string	Name of the FK-constraint	Yes
relationship_ordinal_position	integer	Ordinal position of the column in the FK-constraint. Should be 1 even if there is only one column in the constraint.	Yes
from_datasource_name	string	Name of the database with the FK constraint	Yes
from_dataset_group_name	string	Name of the schema with the FK constraint	Yes
from_dataset_name	string	Name of the table with the FK constraint	Yes
from_field_name	string	Name of the column in the FK constraint	Yes
to_datasource_name	string	Name of the database with the target/referenced column of the FK constraint	Yes
to_dataset_group_name	string	Name of the schema with the target/referenced column of the FK constraint	Yes
to_dataset_name	string	Name of the table with the target/referenced column of the FK constraint	Yes
to_field_name	string	Name of the target/referenced column of the FK constraint	Yes
relationship_source_created	timestamp	Timestamp for when the FK-constraint was created

Example query:

SELECT
    "relationship_name"
  , "relationship_ordinal_position"
  , "from_datasource_name"
  , "from_dataset_group_name"
  , "from_dataset_name"
  , "from_field_name"
  , "to_datasource_name"
  , "to_dataset_group_name"
  , "to_dataset_name"
  , "to_field_name"
  , "relationship_source_created"

FROM (
  -- information schema, system tables etc.
)

Custom Lineage Import Query

The custom lineage import query must return columns with these exact names (case sensitive). The ones marked with an asterisk (*) cannot be null or an empty string, they are the minimum data needed for Katalogue to correctly import data.

Column Name	Datatype	Description	Is Mandatory
lineage_name	string	Name of the procedure/job/transformation/… that moves the data
lineage_type	string	Type that describes what kind of procedure/job/transformation/… that moves the data
from_datasource_name	string	Name of the upstream database, i.e. source of the procedure/job/transformation/…	Yes
from_dataset_group_name	string	Name of the upstream schema, i.e. source of the procedure/job/transformation/…	Yes
from_dataset_name	string	Name of the upstream table, i.e. source of the procedure/job/transformation/…	Yes
to_datasource_name	string	Name of the downstream database, i.e. target of the procedure/job/transformation/…	Yes
to_dataset_group_name	string	Name of the downstream schema, i.e. target of the procedure/job/transformation/…	Yes
to_dataset_name	string	Name of the downstream table, i.e. target of the procedure/job/transformation/…	Yes

Example query:

SELECT
    "lineage_name"
  , "lineage_type"
  , "from_datasource_name"
  , "from_dataset_group_name"
  , "from_dataset_name"
  , "to_datasource_name"
  , "to_dataset_group_name"
  , "to_dataset_name"

FROM (
  -- information schema, system tables etc.
)

Datasource Filters

Datasource filters is a feature to limit the data extracted from the source system. If no filters are provided, all tables, fields etc found in the INFORMATION_SCHEMA (or similar views) are imported - with the exception for some common tables like the INFORMATION_SCHEMA views themselves.

Datasource filters are applied in the extract query from the datasource. Hence, data that is filtered out with the datasource filter never leaves the datasource.

Example JSON of a datasource filter:

The datasource filters are expressed as a JSON object with the following pattern:

{
  "datasource_filter": [
    {
      "concatenator": "and",
      "filters": [
        {
          "field": "dataset_name",
          "operator": "not_equal",
          "value": "field"
        },
        {
          "field": "dataset_group_name",
          "operator": "in",
          "value": ["public", "stage"]
        }
      ]
    }
  ]
}

For a PostgreSQL datasource, the example above translates to the following SQL:

...
WHERE ( c.table_name <> 'field'
  AND c.table_schema = ANY('{"public","stage"}') )

Valid concatenators:

All filters in the “filters” array are enclosed in a paranthesis and concatenated with the selected concatenator. It is possible to nest concatenators like so:

{
  "datasource_filter": [
    {
      "concatenator": "and",
      "filters": [
        {
          "field": "dataset_name",
          "operator": "not_equal",
          "value": "field"
        },
        {
          "concatenator": "or",
          "filters": [
          {
            "field": "dataset_group_name",
            "operator": "equal",
            "value": "public"
          },
          {
            "field": "dataset_group_name",
            "operator": "equal",
            "value": "stage"
          }
        ]
        }
      ]
    }
  ]
}

For a Postgres datasource, the example above results in the following SQL:

...
WHERE ( c.table_name <> 'field'
  AND ( c.table_schema = 'public'
   OR c.table_schema = 'stage' ) )

Valid fields to filter on:

datasource_name
dataset_group_name
dataset_name
dataset_type_name
field_name
datatype_name
datasource_source_id
dataset_group_source_id
dataset_source_id
field_source_id
dataset_technical_definition

Valid operators and the corresponding SQL map:

equal: ”=”
not_equal: ”<>”
greater: ”>”
less: ”<”
greater_equal: ”>=”
less_equal: ”<=”
”=”: ”=”
”<>”: ”<>”
”>”: ”>”
”<”: ”<”
”>=”: ”>=”
”<=”: ”<=”
in: “IN”
not_in: “NOT IN”
is: “IS”
is_not: “IS NOT”
like: “LIKE”
not_like: “NOT LIKE”
ilike: “ILIKE”
not_ilike: “NOT ILIKE”

ILIKE and NOT ILIKE are translated to valid SQL in databases that do not support these commands natively, e.g. Oracle and SQL Server. LIKE, NOT LIKE, ILIKE and NOT ILIKE also works with an array as value for Postgres and Snowflake datasources. This is currently not supported for any other datasources. Wildcard characters for any of the LIKE operators must be included in the “value” part of the filter clause like so:

{
  "datasource_filter": [
    {
      "concatenator": "and",
      "filters": [
        {
          "field": "dataset_name",
          "operator": "ilike",
          "value": ["field", "dataset%"]
        }
      ]
    }
  ]
}

For a Snowflake datasource, the example above results in the following SQL:

...
WHERE ( c.TABLE_NAME ILIKE ANY ('field', 'dataset%') )

Datasource filters with custom import query

Datasource filters are supported in the same way as stated above when using your own custom import query

Relationship Map Rules

Katalogue imports primary key/foreign key constraints automatically when present in the datasource and presents them visually. With the relationship mapping feature, Katalogue can create such relationships between datasets that do not have pk/fk constraints in the datasource. Hence, this feature can be used to show relationships between datasets in the datasource where such constraints are not supported, such as between database views, or between datasets that lack these relationships. Showing such relationships can be very helpful to analysts or other people to understand e.g. a star schema modelled with views.

Relationship map rules are applied in the import query from the Katalogue stage.raw_relationship table to the public.relationship table. Hence, Katalogue always imports all relationships from the datasource to stage and then applies the map rules when moving data from the stage table. If you do not want some relationships to leave the datasource, you must use your own custom relationship view to filter them out.

Relationship types

Katalogue categorizes all relationships after how they are created. This is a simple label and is not tied to any functionality. The relationship type can be viewed by going to a Field -> Relationship tab.

There are currently three types of relationships:

physical: pk/fk constraint imported from the datasource
map-rule: relationship created with the relationship map rule feature
virtual: relationship created by inheritance through a Field Description. If a Field without a relationship is associated with a Field Description, which in turn is associated with another Field with a relationship, Katalogue will on some occasions automatically create (i.e. “guess”) a relationship for the first Field.

The feature to let Katalogue automatically create virtual relationships can be disabled for the entire datasource by checking the “Disable Virtual Relationships” checkbox in the datasource sync settings.

Example JSON of a relationship map rule:

The relationship map rules are expressed as a JSON object with the following pattern:

{
  "relationshipMap": [
    {
      "concatenator": "and",
      "mapRules": [
        {
          "left": {
            "regexp": "(.*__)?(.*)_DWKEY",
            "captureGroup": 2,
            "field": "from_field_name"
          },
          "operator": "equal",
          "right": {
            "regexp": "D_(.*)",
            "field": "to_dataset_name"
          }
        },
        {
          "left": {
            "regexp": "(.*__)?(.*)",
            "captureGroup": 2,
            "field": "from_field_name"
          },
          "operator": "equal",
          "right": {
            "field": "to_field_name"
          }
        },
        {
          "left": {
            "field": "from_dataset_name"
          },
          "operator": "not_equal",
          "right": {
            "field": "to_dataset_name"
          }
        },
        {
          "left": {
            "field": "from_dataset_group_name"
          },
          "operator": "equal",
          "right": {
            "field": "to_dataset_group_name"
          }
        }
      ]
    }
  ]
}

The example above contain four map rules with the following meaning:

Look at all field names that may contain the prefix __ and contain the suffix _DWKEY and match them with dataset names that has the prefix D_. Example: Fields with name CUSTOMER_DWKEY and for example PREVIOUS__CUSTOMER_DWKEY in any table will match all fields in dataset D_CUSTOMER
Only match field names with the same name. Example: Field CUSTOMER_DWKEY in any table will only match field CUSTOMER_DWKEY
Do not match fields in the same dataset. This is to prevent self-join relationships. Example: Field CUSTOMER_DWKEY in table D_CUSTOMER will not relate to itself.
Only match fields and datasets within the same dataset group. Example: Create relationships within schema MART_CUSTOMER and schema MART_FINANCE, but no relationships between them.

As these four rules are combined with an “and” logic, the result of them will be as follows: “Create a relationship from field CUSTOMER_DWKEY in any table to field CUSTOMER_DWKEY in table D_CUSTOMER as long as the tables are in the same schema.”

The example above results in the following SQL (PostgreSQL), where the map rules get translated to the join conditions for the INNER JOIN at the bottom:

...

"SELECT All PK/FK constraints in the datasource"

UNION

SELECT
      NULL                              AS relationship_name
    , 'map-rule'                        AS relationship_type
    , f_from.datasource_name            AS from_datasource_name
    , f_from.dataset_group_name         AS from_dataset_group_name
    , f_from.dataset_name               AS from_dataset_name
    , f_from.field_name                 AS from_field_name
    , f_from.dataset_type_name          AS from_dataset_type_name
    , f_from.datatype_name              AS from_datatype_name
    , f_to.datasource_name              AS to_datasource_name
    , f_to.dataset_group_name           AS to_dataset_group_name
    , f_to.dataset_name                 AS to_dataset_name
    , f_to.field_name                   AS to_field_name
    , f_to.dataset_type_name            AS to_dataset_type_name
    , f_to.datatype_name                AS to_datatype_name
    , now()                             AS relationship_source_created

FROM fieldsCTE f_from
INNER JOIN fieldsCTE f_to ON (
    (regexp_match("f_from"."field_name",'(.*__)?(.*)_DWKEY','i'))[2] = (regexp_match("f_to"."dataset_name",'D_(.*)','i'))[1]
  AND "f_from"."dataset_name" <> "f_to"."dataset_name"
  AND "f_from"."dataset_group_name" = "f_to"."dataset_group_name"
)

...

Valid concatenators:

All rules in the “mapRules” array are enclosed in a paranthesis and concatenated with the selected concatenator. It is possible to nest concatenators.

Valid fields to match on:

These placeholders are available in the “field” property in the “left” and “right” part of a map rule.

from_datasource_name
from_dataset_group_name
from_dataset_name
from_field_name
from_dataset_type_name
from_datatype_name
to_datasource_name
to_dataset_group_name
to_dataset_name
to_field_name
to_dataset_type_name
to_datatype_name

Valid operators and the corresponding SQL map:

equal: ”=”
not_equal: ”<>”
greater: ”>”
less: ”<”
greater_equal: ”>=”
less_equal: ”<=”
”=”: ”=”
”<>”: ”<>”
”>”: ”>”
”<”: ”<”
”>=”: ”>=”
”<=”: ”<=”
in: “IN”
not_in: “NOT IN”
is: “IS”
is_not: “IS NOT”
like: “LIKE”
not_like: “NOT LIKE”
ilike: “ILIKE”
not_ilike: “NOT ILIKE”

Regexp flags and capture groups

It is possible to provide flags and use capture groups in the regular expression like so:

{
  "relationshipMap": [
    {
      "concatenator": "and",
      "mapRules": [
        {
          "left": {
            "regexp": "(.*__)?(.*)_DWKEY",
            "captureGroup": 2,
            "flags": "i",
            "field": "from_field_name"
          },
          "operator": "equal",
          "right": {
            "regexp": "D_(.*)",
            "field": "to_dataset_name"
          }
        }
      ]
    }
  ]
}

If no flags or capture group are provided, these are the defaults:

"captureGroup": 1
"flags": "i"

Relationship Context Filters

It is possible to include a filter clause along with the relationship map rules to limit them to certain parts of the datasource. Example:

{
  "contextFilters": [
    {
      "concatenator": "and",
      "filters": [
        {
          "field": "from_dataset_group_name",
          "operator": "not_equal",
          "value": "MY_SCHEMA"
        },
    {
          "field": "to_dataset_group_name",
          "operator": "not_equal",
          "value": "MY_SCHEMA"
        }
      ]
    }
  ],
  "relationshipMap": [
  ...
  ]
}

The example above results in this SQL (PostgreSQL):

...

SELECT *
FROM (
    "SELECT All PK/FK constraints in the datasource"

    UNION

    "SELECT All relationships generated by map rules"
)
WHERE ( from_dataset_group_name <> 'my_schema'
AND to_dataset_group_name <> 'my_schema' )

...

The context filter syntax is exactly the same way as for the datasource filters with these exceptions:

Valid fields to filter on are the same as the fields available to use in the relationship map rules.
Context filter values are case-insensitive.

Creating a Datasource Sync Task

Create a Task
Select the task type Datasource Sync
Select the database that should be synced in the list.
Optionally, specify one or more email addresses that will receive an email notification if the sync detects schema changes. This feature requires email notifications to be correctly configured under Settings —> Notifications.
Click Save
To run the datasource sync task, select the task in the list under Tasks and click “Run”.
Open the task to see a job log, and click a job to see more details like job steps and a changelog.

Datasource Sync Task Logic

This section explain the sync logic when syncing datasources.

Dataset Groups

Dataset Groups in the source system and Katalogue are matched on its dataset_group_name. The match is case-sensitive. When synchronizing Katalogue with the source system, if there is no matching Dataset Group in Katalogue, it will be added.

If there is a Dataset Group in the same Datasource in Katalogue but not in the source system, the Dataset Group and its related Datasets and Fields will be deleted from Katalogue (hard delete).

If there is a matching Dataset Group in Katalogue, the following Dataset Group attributes in Katalogue will be overwritten:

dataset_group_name
dataset_group_source_description
dataset_group_source_id
dataset_group_source_modified

Note that the dataset_group_description will not be updated, that attribute is only synchronized when the Dataset Group is added to Katalogue for the first time.

Datasets

Datasets in the source system and Katalogue are matched on its dataset_name. The match is case-sensitive. When synchronizing Katalogue with the source system, if there is no matching Dataset in Katalogue, it will be added.

If there is a Dataset in the same Dataset Group in Katalogue but not in the source system, the Dataset and its related Fields will be deleted from Katalogue (hard delete).

If there is a matching Dataset in Katalogue, the following Dataset attributes in Katalogue will be overwritten:

dataset_name
dataset_source_id
dataset_source_description
dataset_type_id (i.e. if e.g. a database view is converted to a database table with the same name, it will be correctly updated)
dataset_group_id (i.e. if e.g. a database schema changes name or if a dataset is moved to another schema, it will be correctly updated)
dataset_source_modified
dataset_technical_definition

Note that the dataset_description will not be updated, that attribute is only synchronized when the Dataset is added to Katalogue for the first time.

Fields

Fields in the source system and Katalogue are matched on its field_name. The match is case-sensitive. When synchronizing Katalogue with the source system, if there is no matching Field in Katalogue, it will be added. When a Field is added to Katalogue for the first time, it will be associated with an existing Field Description if either one of the following criteria are fulfilled:

There is exactly one Field Description in Katalogue with a matching name* as the Field. If there are multiple Field Descriptions with a matching name, no association will be done.
There is one or more existing Fields in Katalogue with exactly the same name as the Field in the source system and that are already associated with exactly one Field Description. The name of that Field Description does not have to be the same as the Field name. If there are existing Fields with the same name that are associated with different Field Descriptions, no association will be done.

If the Field that is to be added to Katalogue for the first time has a description/comment in the source system and matches with an existing Field Description, that Field Description will be used and the description/comment of the Field will be discarded. If there is no matching Field Description, a new Field Description will be created under some circumstances, see the Field Descriptions section below.

* The match between Field names and Field Description names are done after a conversion to lowercase, removing whitespaces and removing underscores that are not at the beginning of the name. E.g. a Field or Field Description with name “_Customer full_NAME” will be converted to “_customerfullname” in the matching.

If there is a Field in the same Dataset in Katalogue but not in the source system, the Field will be deleted from Katalogue (hard delete).

If there is a matching Field in Katalogue, the following Field attributes in Katalogue will be overwritten:

datatype_id (i.e. if the datatype in the source system is changed, it will be correctly updated)
is_primary_key (i.e. if the primary key status in the source system is changed, it will be correctly updated)

Note that the field_description_id will not be updated (i.e. the association with Field Descriptions), that attribute is only synchronized when the Field is added to Katalogue for the first time.

Field Descriptions

When a Field is added to Katalogue for the first time and if it has a description/comment in the source system, a Field Description will be created (and associated with any new Fields with the same name in the source system) if all the following criteria are fulfilled:

There is no existing Field Description that already matches (see the Fields section above)
There are no other Fields in Katalogue with the same name that are already associated with an existing Field Description

When a Field Description is automatically added, these attributes are set by the logic described:

field_description_name:
- If first letter is underscore, then use the field_name without any changes. Otherwise:
- Replace underscores with whitespace
- Split CamelCase names with whitespace
- Make first letter in each word uppercase, the rest lowercase
- Example: “_FieldName” becomes “_FieldName” whilst both “FieldName” and “field_NAME” becomes “Field Name”
field_role_name:
- “Measure” if the field’s datatype_category is “number” and the name does not contain the strings “id”, “key” or “code”. Otherwise “Dimension”.
field_sensitivity:
- Will always be the field sensitivity that has the is_default flag in the field_sensitivity table. By default, this is the “Internal” field sensitivity.
is_pii:
- Will always be “false”