ETL: Examples

2024-03-28

Premium Feature — Available with all Premium Editions of LabKey Server. Learn more or contact LabKey.

This page contains samples of how ETL syntax can be used to accomplish various actions.

Target Options: Append/Truncate/Merge

Append (Default)

The default target option is to append new data to the existing destination, whether empty or partially populated. Note that this behavior occurs when there is no targetOption parameter provided as well.

In this example, new data from "etl_source" will be appended to the end of the "etl_target" table. If the primary key of any rows in "etl_source" match rows already present in "etl_target", this operation will fail.

<?xml version="1.0" encoding="UTF-8"?>
<etl xmlns="http://labkey.org/etl/xml">
<name>Append</name>
<description>append rows from etl_source to etl_target</description>
<transforms>
<transform id="step1">
<description>Copy to target</description>
<source schemaName="externalSource" queryName="etl_source"/>
<destination schemaName="patient" queryName="etl_target" />
</transform>
</transforms>
</etl>

If you want to append new data, but ignore existing rows in the target, you have two choices:

Append with Two Targets

This example shows how you could populate two target tables from one source table, both using "append". When steps are run in sequence like this, by default the same source would be "applied" to each destination. If needed, you could also add behavior like applying a given stored procedure on the source during step 1 before the second append occurred in step 2 so that the resulting contents were different.

<?xml version="1.0" encoding="UTF-8"?>
<etl xmlns="http://labkey.org/etl/xml">
<name>Append</name>
<description>append rows from etl_source to etl_target and etl_target2</description>
<transforms>
<transform id="step1">
<description>Copy to etl_target</description>
<source schemaName="externalSource" queryName="etl_source" />
<destination schemaName="patient" queryName="etl_target" />
</transform>
<transform id="step2">
<description>Copy to etl_target2 table</description>
<source schemaName="externalSource" queryName="etl_source" />
<destination schemaName="patient" queryName="etl_target2" />
</transform>
</transforms>
<incrementalFilter className="SelectAllFilterStrategy"/>
</etl>

Truncate the Target Query

The following truncates the destination table, without copying any data from a source query. Note the lack of a <source> element.

...
<transform id="trunc">
<description>Truncate etl_target Table</description>
<destination schemaName="patient" queryName="etl_target" targetOption="truncate"/>
</transform>
...

Truncate and Replace

If you specify a source table and use "truncate" on the target table, you will truncate the target and append data from the source table.

...
<transform id="step1">
<description>Clear etl_target and replace with (append) rows from etl_source.</description>
<source schemaName="patient" queryName="etl_source" />
<destination schemaName="patient" queryName="etl_target" targetOption="truncate"/>
</transform>
...

This offers a basic way to keep two tables "in sync". For large queries or complex operations, you may prefer to use an incremental filtering to only "act on" changes in the source since the last time the ETL was run.

Merge

Using the merge target option will both update data for existing rows and append new rows.

...
<transform id="merge">
<description>Merge to target.</description>
<source schemaName="externalSource" queryName="etl_source" />
<destination schemaName="patient" queryName="etl_target" targetOption="merge"/>
</transform>
...

Note that the target option "merge" by itself will operate on the entire source table. In situations where the source table is large, or where you want to apply business logic and need only "newly changed" rows to be considered, use an incremental merge strategy as illustrated below.

Merge with Alternate Key

Ordinarily the primary key is used for merges. If this is not suitable for some reason (would cause duplicates or orphaned data), you can specify an alternate key to use instead. This section shows only the <destination> portion of the ETL.

...
<destination schemaName="vehicle" queryName="etl_target2" targetOption="merge">
<alternateKeys>
<!-- The pk of the target table is the "rowId" column. Use "OtherId" as an alternate match key -->
<column name="OtherId"/>
</alternateKeys>
</destination>
...

Incremental Merge: based on Date

If you want to merge only rows that have changed (or been added) since the last time this ETL was run, combine use of a "merge" target option with use of the "ModifiedSince" filter strategy. This example says to select rows from the "etl_source" table that have changed since the last time this ETL was run, then merge them into the etl_target table.

<?xml version="1.0" encoding="UTF-8"?>
<etl xmlns="http://labkey.org/etl/xml">
<name>Merge</name>
<description>Merge rows from etl_source to etl_target.</description>
<transforms>
<transform id="merge">
<description>Merge to target.</description>
<source schemaName="externalSource" queryName="etl_source" />
<destination schemaName="patient" queryName="etl_target" targetOption="merge"/>
</transform>
</transforms>
<incrementalFilter className="ModifiedSinceFilterStrategy" timestampColumnName="modified" />
</etl>

Incremental Merge: by Run ID

If you only want to merge rows from a specific run ID, and not the entire table, you can use merge with the "Run based" filter strategy.

<?xml version="1.0" encoding="UTF-8"?>
<etl xmlns="http://labkey.org/etl/xml">
<name>MergeByRunId</name>
<description>Merge rows from a specific run id to the target table.</description>
<transforms>
<transform id="step1">
<description>Copy to target</description>
<source schemaName="externalSource" queryName="etl_source" />
<destination schemaName="patient" queryName="etl_target" />
</transform>
</transforms>
<incrementalFilter className="RunFilterStrategy" runTableSchema="patientRuns"
runTable="Transfer" pkColumnName="Rowid" fkColumnName="TransformRun" />

</etl>

Pass Parameters to a SQL Query

The following ETL process passes parameters (MinTemp=99 and MinWeight=150) into its source query; useful when the source query is a parameterized query) that will take some action using the parameters.

<?xml version="1.0" encoding="UTF-8" ?>
<etl xmlns="http://labkey.org/etl/xml">
<name>PatientsToTreated</name>
<description>Transfers from the Patients table to the Treated table.</description>
<transforms>
<transform id="step1">
<description>Patients to Treated Table</description>
<source schemaName="study" queryName="ParameterizedPatients" />
<destination schemaName="study" queryName="Treated"/>
</transform>
</transforms>
<parameters>
<parameter name="MinTemp" value="99" type="DECIMAL" />
<parameter name="MinWeight" value="150" type="DECIMAL" />
</parameters>
</etl>

Java ColumnTransforms

Java developers can add a Java class, which extends the org.labkey.api.di.columnTransform.ColumnTransform abstract class, to handle the transformation step of an ETL process. Within the <destination> element, you can point the "column to be transformed" to the Java class as follows:

...
<columnTransforms>
<column source="columnToTransform" transformClass="org.labkey.di.columnTransforms.MyJavaClass"/>
</columnTransforms>
...

The Java class receives the values of the column one row at a time. The Java class can validate, transform or perform some other action on these values. What is returned from the doTransform method of this class is what gets inserted into the target table. See below for an example implementation.

The ETL source below uses the Java class org.labkey.di.columnTransforms.TestColumnTransform to apply changes to data in the "name" column.

ETL.xml

<?xml version="1.0" encoding="UTF-8"?>
<etl xmlns="http://labkey.org/etl/xml">
<name>Append Transformed Column</name>
<description>Append rows from etl_source to etl_target, applying column transformation using a Java class.</description>
<transforms>
<transform id="step1" type="org.labkey.di.pipeline.TransformTask">
<description>Copy to target</description>
<source schemaName="vehicle" queryName="etl_source" />
<destination schemaName="vehicle" queryName="etl_target">
<columnTransforms>
<column source="name" transformClass="org.labkey.di.columnTransforms.TestColumnTransform"/>
</columnTransforms>
<constants>
<column name="myConstant" type="varchar" value="aConstantValue"/>
</constants>
</destination>
</transform>
</transforms>
<incrementalFilter className="SelectAllFilterStrategy" />
</etl>

The Java class below is used by the ETL process to apply transformations to the supplied column, in this case the "name" column.

TestColumnTransform.java

package org.labkey.di.columnTransforms;

import org.labkey.api.di.columnTransform.AbstractColumnTransform;

/**
* An example of Java implementing a transform step.
* Prepends the value of the "id" column of the source query
* to the value of the source column specified in the ETL configuration xml,
* then appends the value of the "myConstant" constant set in the xml.
*/
public class TestColumnTransform extends ColumnTransform
{
@Override
protected Object doTransform(Object inputValue)
{
Object prefix = getInputValue("id");
String prefixStr = null == prefix ? "" : prefix.toString();
return prefixStr + "_" + inputValue + "_" + getConstant("myConstant");
}
}

Related Topics