Premium Feature — Available with all Premium Editions of LabKey Server. Learn more or contact LabKey.

This topic covers the strategies you can use to define how your ETL determines the rows in the source that should be acted upon, or reflected in the destination query in some manner.

  • <sourceFilters>: An individual transform action may only need to act upon a filtered subset of rows.
  • <incrementalFilter>: The ETL as a whole should only act on the rows that have changed since the last time it was run.
This topic covers both approaches, which may be used separately or together to achieve various goals.

Source Filtering

The most common way to filter or shape data is to write a query in LabKey against the source or create a view on an external server with the source. However if using a remote server where filtering the data is not easily possible, or only wanting to do minimal filtering, the ETL can apply filters on any of the source table's columns to only pull a subset. These <sourceFilters> are in addition to any configured for the incremental filtering described below.

Source filters are applied on a per-transform basis in the ETL's XML definition. As with most other filter definitions in LabKey Server, if multiple filters are defined, they are ANDed together. This example filters the source table's demographics information to only include women who are 20 years or older:

<transform id="step1forEditTest" type="org.labkey.di.pipeline.TransformTask">
<source queryName="demographics" schemaName="etlsource">
<sourceFilters>
<sourceFilter column="sex" operator="eq" value="female" />
<sourceFilter column="age" operator="gte" value="20" />
</sourceFilters>
</source>
<destination queryName="demographics" schemaName="study" targetOption="merge" />
</transform>

Note that these filters are not used for the deleted rows query, if present.

Incremental Filter Strategy Options

ETLs move data from a source to a destination. An incremental filter strategy is used when the ETL should track changes between runs, i.e. when subsequent ETL runs should avoid reprocessing rows from previous runs. There are three incremental filtering strategy options:

ModifiedSinceFilterStrategy

Select rows that have modifications, or were created, in the source table since the last time this ETL was run. You might choose this option for performance reasons, or in order to apply business logic (such as in a trigger) that expects rows to be imported only when changed. Having your ETL act only on 'new' information is the goal.

To accomplish the incremental filtering, the ETL framework saves a timestamp of the last time this ETL ran. That value will be used to query the source only for those rows that have been modified (or created) since then. The source column you name in the timestampColumnName property will be used to select rows. This column must be of type "Date Time" and is required for this type of filter.

For example, the strategy below says to check for updated data by checking the "Modified" field on the row in the source query against the last time the ETL was run. The first time you run the ETL, all rows will "match".

...
<incrementalFilter className="ModifiedSinceFilterStrategy" timestampColumnName="Modified">
</incrementalFilter>
...

Note that if your intention is to use an ETL with a ModifiedSinceFilterStrategy to keep two sets of data "in sync" you also need to consider rows deleted in the source query as they will no longer exist to "match" and be acted upon.

RunFilterStrategy

The Run filter strategy is used to run the ETL against rows related to a specified run. You include the column to check, typically an increasing integer column (e.g. Run ID), against a given or stored value. For instance, any rows with a higher value than when the ETL process was last run are acted upon. Often used for relational data, multi-staged transfer pipelines, or when an earlier upstream process is writing a batch of parent-child data to the source. Useful when child records must be transferred at the same time as the parent records.

SelectAllFilterStrategy

Select all data rows from the source, applying no filter. This is the default when no incremental filter is specified.

Incremental Deletion of Rows

When running an ETL, the source data will be queried in accordance with the ETL filter strategy and the results will be inserted/updated per the destination target option. However, when rows are deleted from the source they will not show up in the results of the ETL query for new/updated rows, as they no longer exist in the source, and thus the deletions will not be propagated to the destination data. There are two options to represent those deletions in the destination table.

For ETLs that write to a file as the destination, incremental deletion is not an option, as the file is not edited upon subsequent runs.

Use Truncate

If you have a small/quick running ETL and use a "truncate" target option, then every time the ETL runs, all rows in the destination will be deleted and replaced with the "remaining" undeleted rows in the source. While this will effectively accomplish incremental deletion, it may not be practical for large queries and complex ETLs.

Use deletedRowsSource

When you include a <deletedRowsSource> element, you can identify another table (or view) on the source side that keeps a record of deleted rows. This table should contain a key value from the source table for each deleted row as well as a timestamp to indicate when it was deleted. After an ETL with an incremental filter imports any updated/created rows, it will next query the deletedRowsSource for any timestamps that are later than the last time the ETL ran and delete any rows from the target table that are returned from this query. It will match these rows by the keys identified in the deletedRowsSource properties deletedSourceKeyColumnName and targetKeyColumnName. Most commonly, audit tables or views created from audit tables are used as deletedRowsSources.

The <deletedRowsSource> element is part of the <incrementalFilter> section. For example, here we have a "ModifiedSince" filter that will also check the query "projects_deletedRowsSource" for rows to delete:

...
<incrementalFilter className="ModifiedSinceFilterStrategy" timestampColumnName="modified">
<deletedRowsSource schemaName="etltest" queryName="projects_deletedRowsSource"
deletedSourceKeyColumnName="project" timestampColumnName="modified" targetKeyColumnName="project"/>
</incrementalFilter>

deletedRowsSource allows you to specify both the deletedSourceKeyColumnName and targetKeyColumnName for use in matching rows for deletion from the target.

Even if there are no new (or recently modified) rows in the source query, any new records in the "deletedRowsSource" will still be tracked and deleted from the target.

Note that using <deletedRowSource> with an <incrementalFilter> strategy does not support a remote connection.

Related Topics

Discussion

Was this content helpful?

Log in or register an account to provide feedback


previousnext
 
expand all collapse all