Premium Feature — Available with all Premium Editions of LabKey Server. Learn more or contact LabKey.

The heart of an ETL is the <transforms> element, which contains one or more individual <transform> elements. Each of these specifies a <source> and <destination>.

This topic covers options available for the destination or target of an ETL. The destination may be either a query/table in a database, or a file. The parameters and options available for each option are covered in this topic.

Destination Element

Within the <transforms> section, each <transform> included in your ETL will have a <source> and a <destination>, which can be either a query or a file.

Query Target

The default type of target is a query destination, i.e. a target in a database. Parameters include:

  • type = "query" Default, so need not be included.
  • schemaName: The schema where you will find the query in queryName. It may be local, external, or linked. In the below example, "etltest".
  • queryName: The table located in that schema. In this example, "endTable"
  • It looks like:
<transform id="step1" type="org.labkey.di.pipeline.TransformTask">
<description>Copy from source to target</description>
<source schemaName="etltest" queryName="source" />
<destination schemaName="etltest" queryName="endTable" />
</transform>

Learn about more options for database targets below.

File Target

An ETL process can load data to a file, such as a tab separated file (TSV), instead of loading data into a database. For a file destination, the syntax includes:

  • type = "file"
  • dir
  • fileBaseName
  • fileExtension
For example, the following ETL configuration element directs outputs to a tab separated file named "report.tsv".
<destination type="file" dir="etlOut" fileBaseName="report-${TransformRunId}-${Timestamp}" fileExtension="tsv" />

Optional parameters include:

  • rowDelimiter and columnDelimiter: if omitted you get a standard TSV file.
Special substitutions
  • ${TransformRunId} and ${Timestamp} are optional special tokens for the fileBaseName to substitute those values into the output file name of each run.

Target Options for Table Destinations

Target Operation - Append, Truncate, or Merge

When the destination is a destination database table, there are additional parameters available, including targetOption, with three options for handling cases when the source query returns key values that already exist in the destination. You cannot target a LabKey SQL query.

  • Append: (Default) Appends new rows to the end of the existing table. The ETL will fail if it encounters duplicate primary key values.
  • Truncate: Deletes the contents of the destination table before inserting the source data, if any.
  • Merge: Merges data into the destination table.
    • Matches primary key values to determine insert or update per row.
    • Target tables must have a primary key.
    • Not supported for tables with auto-incrementing or system-managed keys, as these guarantee import row uniqueness.
    • Not fully supported for insert into tables that also use a trigger script. Learn more below.
For example, the following snippet will add new rows to the "vehicle.endTable" for uniquely keyed rows in the "sourceSchema.sourceTable", and also update the data in any rows with keys already present in the destination query.

...
<transform id="mergeStep">
<description>Copy to target</description>
<source schemaName="sourceSchema" queryName="sourceTable"/>
<destination schemaName="vehicle" queryName="endTable" targetOption="merge" />
</transform>
...

If you used the "append" target option instead, this ETL step would fail if any rows in "sourceSchema.sourceTable" were already in the "vehicle.endTable" (even if they had not changed). If your goal is to append new rows, but not fail because of rows previously transferred by an earlier run of this ETL, you would use an incremental filter strategy to only act on the new rows.

A series of usage examples for these target options (and incremental filter strategies) is available in the topic:

Alternate Keys for Merging Data

When the source and target tables use different primary keys (for example, the target table uses an auto-incrementing integer value), you can use a different column to match incoming rows for merge operations. More than one column can be used to match if needed, similar to a multi-column primary key.

For example, a study dataset can be configured to allow different levels of cardinality. Those components included (participant, timepoint, and/or other column) are composited into a single LSID column. This example tells the server to not use the LSID column, and instead to match against the ObjectId column.

<destination schemaName="study" queryName="myDataset" targetOption="merge">
<alternateKeys>
<column name="objectid"/>
</alternateKeys>
</destination>

Bulk Loading

LabKey Server provides audit logging for many types of changes, including ETL data transfer. When LabKey Server is not the system of record, this may be overhead that does not provide any value. You can set bulk loading to be true to bypass the logging. By default, it is set to false.

<destination schemaName="vehicle" queryName="endTable" bulkLoad="true" />

ETL Context for Trigger Scripts

Trigger scripts offer a way for a module to validate or transform data when it is manipulated. Sometimes when data is being transferred via ETL it is useful to adjust the trigger behavior. For example, you might relax validation during an ETL because it is legacy data that cannot be updated in its source system to reflect the desired validation for newly entered data. The ETL framework uses a setting on the trigger script's extraContext object to convey that it is running as part of an ETL.

// Do general validation
if (extraContext.dataSource !== 'etl') {
// Do strict validation
}

Using Merge with Trigger Scripts

When there is a trigger script on the target table, regardless of whether your ETL uses the extraContext object described above, the merge target option for the ETL is not supported. Instead, an ETL "merge" will behave more like a "replace" for the row where the key matches an existing row.

In a situation where your target table has many columns of data and you have an ETL updating rows but only providing a subset of the columns, that "subset" row will be passed as a complete row via the trigger script and fill in the missing columns as null. This means that if your source query does not include every column expected in the target table, you could lose data through the row effectively being "replaced" by the incoming ETL row.

One possible workaround is to LEFT JOIN the columns in the destination table to the source query. This will provide the missing column data for any rows being updated and prevent the previously existing destination data from being set to null.

Map Column Names

If the columns in your source and destination will use the same column names for the same data, they will automatically be matched. If not, you can specify an explicit mapping as part of the <destination> element. Learn more in this topic: ETL: Column Mapping.

Related Topics

Was this content helpful?

Log in or register an account to provide feedback


previousnext
 
expand allcollapse all