Skip to content

Datasource Syncing

Eventhough you can add datasets, fields etc manually in Katalogue, it is highly recommended to use the datasource sync functionality to automatically extract and sync metadata from your datasources. To set this up, you must first create a connection in Katalogue and configure it with connection credentials (see supported connectors in the Connectors section). Then, you must create a datasource sync task for the datasource.

This feature is supported for the following Connection Types:

  • Databricks
  • IBM DB2
  • Microsoft SQL Server
  • Oracle
  • PostgreSQL
  • Snowflake

Katalogue will by default use pre-defined SQL queries for each connector when connecting to a relational database. It is possible to override these default queries with user-defined custom queries. Some use cases for this are:

  • Provide custom filters instead of using Katalogue’s datasource filters feature described below.
  • Use custom logic to retrieve e.g. primary key definitions. Useful in scenarios where actual primary key constraints are missing but where the primary keys can be derived from unique indexes, naming conventions etc.
  • You do not want to, or do not have access to, connect to the actual source system directly and therefore use the datasource as an intermediary.

Katalogue uses three import queries to get all metadata it needs;

  1. Base Import Query Imports columns, datasets, schemas and related metadata such as column datatype, primary keys etc.
  2. Relationship Import Query Imports relationships (PK/FK constraints) between tables.
  3. Lineage Import Query Imports data lineage between tables.

Define a custom query like so:

  1. Create a Connection and select a Connection Type that support custom import queries
  2. Go to the “ADVANCED” tab and enable “Use Custom Base Import Query”.
  3. Click the “Edit Base Import Query” button to open a dialog where you can enter a custom SQL select statement. See requirements below.
  4. It is possible to retrieve the default import query for the Connection Type by clicking the “Insert Default Query” button in the dialog
  5. Click “Save” to save the query and close the dialog
  6. Optionally, repeat the steps above for the “Use Custom Relationship Import Query” and “Use Custom Lineage Import Query” option to provide a custom SQL select statement for PK/FK constraints and dataset lineage.

Note that SELECT-statements are the only supported SQL statement. Katalogue will validate and sanitize the query before execution to prevent SQL injections.

The custom import query must return all columns with the exact names (case sensitive) listed below. The mandatory columns must be present, cannot be null or an empty string. They are the minimum data needed for Katalogue to correctly import data.

The columns suffixed with “_source_id” are strongly recommended. Including them enables the datasource sync to handle renames, as assets are primarily matched by name, or if not found, source id. If there is no source id, Katalogue will perform a delete and insert, which means that associations with Field Descriptions, custom descriptions etc will get lost.

Column NameDatatypeDescriptionIs Mandatory
datasource_namestringDatabase nameYes
dataset_group_namestringSchema nameYes
dataset_namestringName of the table/viewYes
dataset_type_namestringType of dataset. Must be either “table” or “view”.Yes
field_namestringColumn nameYes
datatype_namestringDatatype name. Should be the raw name without appended precision and scale.Yes
datatype_lengthintegerCharacter length
datatype_precisionintegerNumeric precision
datatype_scaleintegerNumeric scale
datasource_type_namestringDatabase type. If present, this will override the datasource type derived from the connector. I.e. it is possible to set this to “Microsoft SQL Server” eventhough the connector is to an Oracle database. The datasource will then show up as a “Microsoft SQL Server” datasource in the GUI instead of an Oracle datasource.
datasource_source_idstringUnique, internal Id of the database in the source database. If present, Katalogue will use this to handle renames better by doing an update instead of a delete & insert statement.
dataset_group_source_idstringUnique, internal Id of the schema in the source database. If present, Katalogue will use this to handle renames better by doing an update instead of a delete & insert statement.
dataset_source_idstringUnique, internal Id of the table/view in the source database. If present, Katalogue will use this to handle renames better by doing an update instead of a delete & insert statement.
field_source_idstringUnique, internal Id of the column in the source database. If present, Katalogue will use this to handle renames better by doing an update instead of a delete & insert statement.
dataset_row_countbigintNumber of rows in the dataset. Normaly taken from a statistics table (much faster than a “select count(*)“).
dataset_size_bytesbigintTotal disk size in bytes of the dataset.
is_primary_keybooleanTrue if the column is part of a primary key constraint
is_nullablebooleanTrue if the column can be NULL
ordinal_positionintegerOrdinal position of the column
default_valuestringDefault value of the column
dataset_urlstringURL of the dataset. This will show up as a clickable link in the GUI
datasource_source_createdtimestampTimestamp for when the database was created
dataset_group_source_createdtimestampTimestamp for when the schema was created
dataset_source_createdtimestampTimestamp for when the table/view was created
datasource_source_modifiedtimestampTimestamp for when the database was last altered
dataset_group_source_modifiedtimestampTimestamp for when the schema was last altered
dataset_source_modifiedtimestampTimestamp for when the table/view was last altered
dataset_technical_definitionstringNormally the SQL statement for views, but can be any transformation logic to populate a table or view.
dataset_technical_definition_rawstringThis can be used if the SQL statement/transformation logic is dynamically created. This is the case for e.g. dbt datasources, then this column will contain the raw, uncompiled dbt code, and the “dataset_technical_definition” column contains the compiled SQL.
datasource_descriptionstringComment or description of the database
dataset_group_descriptionstringComment or description of the schema
dataset_descriptionstringComment or description of the table/view
field_descriptionstringComment or description of the column

Example query:

SELECT
"datasource_name"
, "dataset_group_name"
, "dataset_name"
, "dataset_type_name"
, "field_name"
, "datatype_name"
, "datatype_length"
, "datatype_precision"
, "datatype_scale"
, "datasource_type_name"
, "datasource_source_id"
, "dataset_group_source_id"
, "dataset_source_id"
, "field_source_id"
, "dataset_row_count"
, "dataset_size_bytes"
, "is_primary_key"
, "is_nullable"
, "ordinal_position"
, "default_value"
, "dataset_url"
, "datasource_source_created"
, "dataset_group_source_created"
, "dataset_source_created"
, "datasource_source_modified"
, "dataset_group_source_modified"
, "dataset_source_modified"
, "dataset_technical_definition"
, "dataset_technical_definition_raw"
, "datasource_description"
, "dataset_group_description"
, "dataset_description"
, "field_description"
FROM (
-- information schema, system tables etc.
)

The custom relationship import query must return columns with these exact names (case sensitive). The mandatory columns must be present, cannot be null or an empty string. They are the minimum data needed for Katalogue to correctly import data.

Column NameDatatypeDescriptionIs Mandatory
relationship_namestringName of the FK-constraintYes
relationship_ordinal_positionintegerOrdinal position of the column in the FK-constraint. Should be 1 even if there is only one column in the constraint.Yes
from_datasource_namestringName of the database with the FK constraintYes
from_dataset_group_namestringName of the schema with the FK constraintYes
from_dataset_namestringName of the table with the FK constraintYes
from_field_namestringName of the column in the FK constraintYes
to_datasource_namestringName of the database with the target/referenced column of the FK constraintYes
to_dataset_group_namestringName of the schema with the target/referenced column of the FK constraintYes
to_dataset_namestringName of the table with the target/referenced column of the FK constraintYes
to_field_namestringName of the target/referenced column of the FK constraintYes
relationship_source_createdtimestampTimestamp for when the FK-constraint was created

Example query:

SELECT
"relationship_name"
, "relationship_ordinal_position"
, "from_datasource_name"
, "from_dataset_group_name"
, "from_dataset_name"
, "from_field_name"
, "to_datasource_name"
, "to_dataset_group_name"
, "to_dataset_name"
, "to_field_name"
, "relationship_source_created"
FROM (
-- information schema, system tables etc.
)

The custom lineage import query must return columns with these exact names (case sensitive). The ones marked with an asterisk (*) cannot be null or an empty string, they are the minimum data needed for Katalogue to correctly import data.

Column NameDatatypeDescriptionIs Mandatory
lineage_namestringName of the procedure/job/transformation/… that moves the data
lineage_typestringType that describes what kind of procedure/job/transformation/… that moves the data
from_datasource_namestringName of the upstream database, i.e. source of the procedure/job/transformation/…Yes
from_dataset_group_namestringName of the upstream schema, i.e. source of the procedure/job/transformation/…Yes
from_dataset_namestringName of the upstream table, i.e. source of the procedure/job/transformation/…Yes
to_datasource_namestringName of the downstream database, i.e. target of the procedure/job/transformation/…Yes
to_dataset_group_namestringName of the downstream schema, i.e. target of the procedure/job/transformation/…Yes
to_dataset_namestringName of the downstream table, i.e. target of the procedure/job/transformation/…Yes

Example query:

SELECT
"lineage_name"
, "lineage_type"
, "from_datasource_name"
, "from_dataset_group_name"
, "from_dataset_name"
, "to_datasource_name"
, "to_dataset_group_name"
, "to_dataset_name"
FROM (
-- information schema, system tables etc.
)

Datasource filters is a feature to limit the data extracted from the source system. If no filters are provided, all tables, fields etc found in the INFORMATION_SCHEMA (or similar views) are imported - with the exception for some common tables like the INFORMATION_SCHEMA views themselves.

Datasource filters are applied in the extract query from the datasource. Hence, data that is filtered out with the datasource filter never leaves the datasource.

The datasource filters are expressed as a JSON object with the following pattern:

{
"datasource_filter": [
{
"concatenator": "and",
"filters": [
{
"field": "dataset_name",
"operator": "not_equal",
"value": "field"
},
{
"field": "dataset_group_name",
"operator": "in",
"value": ["public", "stage"]
}
]
}
]
}

For a PostgreSQL datasource, the example above translates to the following SQL:

...
WHERE ( c.table_name <> 'field'
AND c.table_schema = ANY('{"public","stage"}') )
  • and
  • or

All filters in the “filters” array are enclosed in a paranthesis and concatenated with the selected concatenator. It is possible to nest concatenators like so:

{
"datasource_filter": [
{
"concatenator": "and",
"filters": [
{
"field": "dataset_name",
"operator": "not_equal",
"value": "field"
},
{
"concatenator": "or",
"filters": [
{
"field": "dataset_group_name",
"operator": "equal",
"value": "public"
},
{
"field": "dataset_group_name",
"operator": "equal",
"value": "stage"
}
]
}
]
}
]
}

For a Postgres datasource, the example above results in the following SQL:

...
WHERE ( c.table_name <> 'field'
AND ( c.table_schema = 'public'
OR c.table_schema = 'stage' ) )
  • datasource_name
  • dataset_group_name
  • dataset_name
  • dataset_type_name
  • field_name
  • datatype_name
  • datasource_source_id
  • dataset_group_source_id
  • dataset_source_id
  • field_source_id
  • dataset_technical_definition

Valid operators and the corresponding SQL map:

Section titled “Valid operators and the corresponding SQL map:”
  • equal: ”=”
  • not_equal: ”<>”
  • greater: ”>”
  • less: ”<”
  • greater_equal: ”>=”
  • less_equal: ”<=”
  • ”=”: ”=”
  • ”<>”: ”<>”
  • ”>”: ”>”
  • ”<”: ”<”
  • ”>=”: ”>=”
  • ”<=”: ”<=”
  • in: “IN”
  • not_in: “NOT IN”
  • is: “IS”
  • is_not: “IS NOT”
  • like: “LIKE”
  • not_like: “NOT LIKE”
  • ilike: “ILIKE”
  • not_ilike: “NOT ILIKE”

ILIKE and NOT ILIKE are translated to valid SQL in databases that do not support these commands natively, e.g. Oracle and SQL Server. LIKE, NOT LIKE, ILIKE and NOT ILIKE also works with an array as value for Postgres and Snowflake datasources. This is currently not supported for any other datasources. Wildcard characters for any of the LIKE operators must be included in the “value” part of the filter clause like so:

{
"datasource_filter": [
{
"concatenator": "and",
"filters": [
{
"field": "dataset_name",
"operator": "ilike",
"value": ["field", "dataset%"]
}
]
}
]
}

For a Snowflake datasource, the example above results in the following SQL:

...
WHERE ( c.TABLE_NAME ILIKE ANY ('field', 'dataset%') )

Datasource filters with custom import query

Section titled “Datasource filters with custom import query”

Datasource filters are supported in the same way as stated above when using your own custom import query

Katalogue imports primary key/foreign key constraints automatically when present in the datasource and presents them visually. With the relationship mapping feature, Katalogue can create such relationships between datasets that do not have pk/fk constraints in the datasource. Hence, this feature can be used to show relationships between datasets in the datasource where such constraints are not supported, such as between database views, or between datasets that lack these relationships. Showing such relationships can be very helpful to analysts or other people to understand e.g. a star schema modelled with views.

Relationship map rules are applied in the import query from the Katalogue stage.raw_relationship table to the public.relationship table. Hence, Katalogue always imports all relationships from the datasource to stage and then applies the map rules when moving data from the stage table. If you do not want some relationships to leave the datasource, you must use your own custom relationship view to filter them out.

Katalogue categorizes all relationships after how they are created. This is a simple label and is not tied to any functionality. The relationship type can be viewed by going to a Field -> Relationship tab.

There are currently three types of relationships:

  • physical: pk/fk constraint imported from the datasource
  • map-rule: relationship created with the relationship map rule feature
  • virtual: relationship created by inheritance through a Field Description. If a Field without a relationship is associated with a Field Description, which in turn is associated with another Field with a relationship, Katalogue will on some occasions automatically create (i.e. “guess”) a relationship for the first Field.

The feature to let Katalogue automatically create virtual relationships can be disabled for the entire datasource by checking the “Disable Virtual Relationships” checkbox in the datasource sync settings.

The relationship map rules are expressed as a JSON object with the following pattern:

{
"relationshipMap": [
{
"concatenator": "and",
"mapRules": [
{
"left": {
"regexp": "(.*__)?(.*)_DWKEY",
"captureGroup": 2,
"field": "from_field_name"
},
"operator": "equal",
"right": {
"regexp": "D_(.*)",
"field": "to_dataset_name"
}
},
{
"left": {
"regexp": "(.*__)?(.*)",
"captureGroup": 2,
"field": "from_field_name"
},
"operator": "equal",
"right": {
"field": "to_field_name"
}
},
{
"left": {
"field": "from_dataset_name"
},
"operator": "not_equal",
"right": {
"field": "to_dataset_name"
}
},
{
"left": {
"field": "from_dataset_group_name"
},
"operator": "equal",
"right": {
"field": "to_dataset_group_name"
}
}
]
}
]
}

The example above contain four map rules with the following meaning:

  1. Look at all field names that may contain the prefix __ and contain the suffix _DWKEY and match them with dataset names that has the prefix D_. Example: Fields with name CUSTOMER_DWKEY and for example PREVIOUS__CUSTOMER_DWKEY in any table will match all fields in dataset D_CUSTOMER
  2. Only match field names with the same name. Example: Field CUSTOMER_DWKEY in any table will only match field CUSTOMER_DWKEY
  3. Do not match fields in the same dataset. This is to prevent self-join relationships. Example: Field CUSTOMER_DWKEY in table D_CUSTOMER will not relate to itself.
  4. Only match fields and datasets within the same dataset group. Example: Create relationships within schema MART_CUSTOMER and schema MART_FINANCE, but no relationships between them.

As these four rules are combined with an “and” logic, the result of them will be as follows: “Create a relationship from field CUSTOMER_DWKEY in any table to field CUSTOMER_DWKEY in table D_CUSTOMER as long as the tables are in the same schema.”

The example above results in the following SQL (PostgreSQL), where the map rules get translated to the join conditions for the INNER JOIN at the bottom:

...
"SELECT All PK/FK constraints in the datasource"
UNION
SELECT
NULL AS relationship_name
, 'map-rule' AS relationship_type
, f_from.datasource_name AS from_datasource_name
, f_from.dataset_group_name AS from_dataset_group_name
, f_from.dataset_name AS from_dataset_name
, f_from.field_name AS from_field_name
, f_from.dataset_type_name AS from_dataset_type_name
, f_from.datatype_name AS from_datatype_name
, f_to.datasource_name AS to_datasource_name
, f_to.dataset_group_name AS to_dataset_group_name
, f_to.dataset_name AS to_dataset_name
, f_to.field_name AS to_field_name
, f_to.dataset_type_name AS to_dataset_type_name
, f_to.datatype_name AS to_datatype_name
, now() AS relationship_source_created
FROM fieldsCTE f_from
INNER JOIN fieldsCTE f_to ON (
(regexp_match("f_from"."field_name",'(.*__)?(.*)_DWKEY','i'))[2] = (regexp_match("f_to"."dataset_name",'D_(.*)','i'))[1]
AND "f_from"."dataset_name" <> "f_to"."dataset_name"
AND "f_from"."dataset_group_name" = "f_to"."dataset_group_name"
)
...
  • and
  • or

All rules in the “mapRules” array are enclosed in a paranthesis and concatenated with the selected concatenator. It is possible to nest concatenators.

These placeholders are available in the “field” property in the “left” and “right” part of a map rule.

  • from_datasource_name
  • from_dataset_group_name
  • from_dataset_name
  • from_field_name
  • from_dataset_type_name
  • from_datatype_name
  • to_datasource_name
  • to_dataset_group_name
  • to_dataset_name
  • to_field_name
  • to_dataset_type_name
  • to_datatype_name

Valid operators and the corresponding SQL map:

Section titled “Valid operators and the corresponding SQL map:”
  • equal: ”=”
  • not_equal: ”<>”
  • greater: ”>”
  • less: ”<”
  • greater_equal: ”>=”
  • less_equal: ”<=”
  • ”=”: ”=”
  • ”<>”: ”<>”
  • ”>”: ”>”
  • ”<”: ”<”
  • ”>=”: ”>=”
  • ”<=”: ”<=”
  • in: “IN”
  • not_in: “NOT IN”
  • is: “IS”
  • is_not: “IS NOT”
  • like: “LIKE”
  • not_like: “NOT LIKE”
  • ilike: “ILIKE”
  • not_ilike: “NOT ILIKE”

It is possible to provide flags and use capture groups in the regular expression like so:

{
"relationshipMap": [
{
"concatenator": "and",
"mapRules": [
{
"left": {
"regexp": "(.*__)?(.*)_DWKEY",
"captureGroup": 2,
"flags": "i",
"field": "from_field_name"
},
"operator": "equal",
"right": {
"regexp": "D_(.*)",
"field": "to_dataset_name"
}
}
]
}
]
}

If no flags or capture group are provided, these are the defaults:

"captureGroup": 1
"flags": "i"

It is possible to include a filter clause along with the relationship map rules to limit them to certain parts of the datasource. Example:

{
"contextFilters": [
{
"concatenator": "and",
"filters": [
{
"field": "from_dataset_group_name",
"operator": "not_equal",
"value": "MY_SCHEMA"
},
{
"field": "to_dataset_group_name",
"operator": "not_equal",
"value": "MY_SCHEMA"
}
]
}
],
"relationshipMap": [
...
]
}

The example above results in this SQL (PostgreSQL):

...
SELECT *
FROM (
"SELECT All PK/FK constraints in the datasource"
UNION
"SELECT All relationships generated by map rules"
)
WHERE ( from_dataset_group_name <> 'my_schema'
AND to_dataset_group_name <> 'my_schema' )
...

The context filter syntax is exactly the same way as for the datasource filters with these exceptions:

  • Valid fields to filter on are the same as the fields available to use in the relationship map rules.
  • Context filter values are case-insensitive.
  1. Create a Task

  2. Select the task type Datasource Sync

  3. Select the database that should be synced in the list.

  4. Optionally, specify one or more email addresses that will receive an email notification if the sync detects schema changes. This feature requires email notifications to be correctly configured under Settings —> Notifications.

  5. Click Save

  6. To run the datasource sync task, select the task in the list under Tasks and click “Run”.

  7. Open the task to see a job log, and click a job to see more details like job steps and a changelog.

This section explain the sync logic when syncing datasources.

Dataset Groups in the source system and Katalogue are matched on its dataset_group_name. The match is case-sensitive. When synchronizing Katalogue with the source system, if there is no matching Dataset Group in Katalogue, it will be added.

If there is a Dataset Group in the same Datasource in Katalogue but not in the source system, the Dataset Group and its related Datasets and Fields will be deleted from Katalogue (hard delete).

If there is a matching Dataset Group in Katalogue, the following Dataset Group attributes in Katalogue will be overwritten:

  • dataset_group_name
  • dataset_group_source_description
  • dataset_group_source_id
  • dataset_group_source_modified

Note that the dataset_group_description will not be updated, that attribute is only synchronized when the Dataset Group is added to Katalogue for the first time.

Datasets in the source system and Katalogue are matched on its dataset_name. The match is case-sensitive. When synchronizing Katalogue with the source system, if there is no matching Dataset in Katalogue, it will be added.

If there is a Dataset in the same Dataset Group in Katalogue but not in the source system, the Dataset and its related Fields will be deleted from Katalogue (hard delete).

If there is a matching Dataset in Katalogue, the following Dataset attributes in Katalogue will be overwritten:

  • dataset_name
  • dataset_source_id
  • dataset_source_description
  • dataset_type_id (i.e. if e.g. a database view is converted to a database table with the same name, it will be correctly updated)
  • dataset_group_id (i.e. if e.g. a database schema changes name or if a dataset is moved to another schema, it will be correctly updated)
  • dataset_source_modified
  • dataset_technical_definition

Note that the dataset_description will not be updated, that attribute is only synchronized when the Dataset is added to Katalogue for the first time.

Fields in the source system and Katalogue are matched on its field_name. The match is case-sensitive. When synchronizing Katalogue with the source system, if there is no matching Field in Katalogue, it will be added. When a Field is added to Katalogue for the first time, it will be associated with an existing Field Description if either one of the following criteria are fulfilled:

  • There is exactly one Field Description in Katalogue with a matching name* as the Field. If there are multiple Field Descriptions with a matching name, no association will be done.
  • There is one or more existing Fields in Katalogue with exactly the same name as the Field in the source system and that are already associated with exactly one Field Description. The name of that Field Description does not have to be the same as the Field name. If there are existing Fields with the same name that are associated with different Field Descriptions, no association will be done.

If the Field that is to be added to Katalogue for the first time has a description/comment in the source system and matches with an existing Field Description, that Field Description will be used and the description/comment of the Field will be discarded. If there is no matching Field Description, a new Field Description will be created under some circumstances, see the Field Descriptions section below.

* The match between Field names and Field Description names are done after a conversion to lowercase, removing whitespaces and removing underscores that are not at the beginning of the name. E.g. a Field or Field Description with name “_Customer full_NAME” will be converted to “_customerfullname” in the matching.

If there is a Field in the same Dataset in Katalogue but not in the source system, the Field will be deleted from Katalogue (hard delete).

If there is a matching Field in Katalogue, the following Field attributes in Katalogue will be overwritten:

  • datatype_id (i.e. if the datatype in the source system is changed, it will be correctly updated)
  • is_primary_key (i.e. if the primary key status in the source system is changed, it will be correctly updated)

Note that the field_description_id will not be updated (i.e. the association with Field Descriptions), that attribute is only synchronized when the Field is added to Katalogue for the first time.

When a Field is added to Katalogue for the first time and if it has a description/comment in the source system, a Field Description will be created (and associated with any new Fields with the same name in the source system) if all the following criteria are fulfilled:

  • There is no existing Field Description that already matches (see the Fields section above)
  • There are no other Fields in Katalogue with the same name that are already associated with an existing Field Description

When a Field Description is automatically added, these attributes are set by the logic described:

  • field_description_name:
    • If first letter is underscore, then use the field_name without any changes. Otherwise:
    • Replace underscores with whitespace
    • Split CamelCase names with whitespace
    • Make first letter in each word uppercase, the rest lowercase
    • Example: “_FieldName” becomes “_FieldName” whilst both “FieldName” and “field_NAME” becomes “Field Name”
  • field_role_name:
    • “Measure” if the field’s datatype_category is “number” and the name does not contain the strings “id”, “key” or “code”. Otherwise “Dimension”.
  • field_sensitivity:
    • Will always be the field sensitivity that has the is_default flag in the field_sensitivity table. By default, this is the “Internal” field sensitivity.
  • is_pii:
    • Will always be “false”