Premium Feature — Available in the Professional, Professional Plus, and Enterprise Editions. Learn more or contact LabKey.

Multiple Steps in a Single ETL or Multiple ETLs?

  • Do changes to the source affect multiple target datasets at once? If so consider configuring multiple steps in one ETL definition.
  • Do source changes impact a single target dataset? Consider using multiple ETL definitions, one for each dataset.
  • Are the target queries relational? Consider multiple steps in one ETL definition.

Configuration Options

The following configuration options are offered for customizing ETL processes:

Schedules

You can set a polling schedule to check the source database for new data and automatically run the ETL process when new data is found. Either specify a time interval, or use a full cron expression to schedule ETLs. When choosing a schedule for running ETLs, consider the timing of other processes, like automated backups, which could cause conflicts with running your ETL.

The schedule below checks every hour for new data:

<schedule><poll interval="1h" /></schedule>

These examples show some cron expressions to schedule running of the job:

<!-- run at 10:15 every day -->
<schedule><cron expression="0 15 10 ? * *"/></schedule>

<!-- run every hour on the hour every day, i.e. 9:00, 10:00, etc. -->
<schedule><cron expression="0 0 * ? * *"/></schedule>

<!-- run on Tuesdays and Thursdays at 3:30 pm -->
<schedule><cron expression="0 30 15 ? * TUE,THU *"/></schedule>

Cron expressions consist of six or seven space separated strings for the seconds, minutes, hours, day-of-month, month, day-of-week, and optional year in that order. The wildcard '*' indicates every valid value. The character '?' is used in the day-of-month or day-of-week field to mean 'no specific value,' i.e, when the other value is used to define the days to run the job.

It is good practice to include a comment clarifying what the cron expression you used means in plain text.

To assist you, use a builder for the Quartz cron format. One is available here: https://www.freeformatter.com/cron-expression-generator-quartz.html.

A full description of the cron syntax is available on the Quartz site here.

Note: If checking the source database involves a long running query or many ETLs are scheduled for the same time, there can be a delay between the scheduled time and the time the ETL job is placed in the pipeline queue, corresponding to the database response. It's possible this could result in an execution order inconsistent with the chronological order of closely scheduled ETLs.

If ETLs must run in a particular order, it is recommended to put them as multiple steps in another ETL to ensure the order of execution. This can be done in either of the following ways:

Target Options

When the data is loaded into the destination database, there are three options for handling cases when the source query returns key values that already exist in the destination:

  • Append: Appends new rows to the end of the existing table. Fails on duplicate primary key values.
  • Merge: Merges data into the destination table. Matches primary key values to determine insert or update. Target tables must have a primary key.
  • Truncate: Deletes the contents of the destination table before inserting the selected data.
For example:

<destination schemaName="vehicle" queryName="targetQuery" targetOption="merge" />

Note: Merge and truncate are only supported for datasets, not lists.

Filter Strategy

Filter strategies define how the ETL process identifies new rows in the source database to be pulled over to the target.

For details see ETL: Filter Strategies.

File Targets

An ETL process can load data to a file, such as a comma separated file (CSV), instead of loading data into a database table. For example, the following ETL configuration element directs outputs to a tab separated file named "report.tsv". rowDelimiter and columnDelimiter are optional, if omitted you get a standard TSV file.

<destination type="file" dir="etlOut" fileBaseName="report" fileExtension="tsv" />

Transaction Options

Transact Multi-Step ETL

A multi-step ETL can be wrapped in a single transaction on the source and/or destination database side, ensuring that the entire process will proceed on a consistent set of source data, and that the entire operation can be rolled back if any error occurs. This option should be considered only if:

  • Every step in the ETL uses the same source and/or destination scope. Individual steps may use different schemas within these sources.
  • Your desired operation cannot be supported using "modified" timestamps or "transfer id" columns, which are the preferred methods for ensuring data consistency across steps.
Note that transacting the source database is only supported for Postgres. SQL Server does not support the required SNAPSHOT isolation on REPEATABLE READ required to avoid locking issues.

To enable multi-step transactions, use the the “transactSourceSchema” and “transactDestinationSchema” attributes on the top level “etl” element in an ETL xml:

<etl xmlns="http://labkey.org/etl/xml" transactSourceSchema="SOURCE_SCHEMA_NAME" transactDestinationSchema="DESTINATION_SCHEMA_NAME">

The specified schema names can be any schema in the datasource of which LabKey is aware, i.e. any schema in the LabKey datasource, or any external schema mapped in an external datasource. Schema names are used instead of datasource names because the schema names are known to the etl author, whereas datasource names can be arbitrary and set at server setup time.

Disable Transactions

Note that disabling transactions risks leaving the destination or target table in an intermediate state if an error occurs during ETL processing.

ETL steps are, by default, run as transactions. To turn off transactions when running an ETL process, set useTransaction to false on the destination, as shown below:

<destination schemaName="study" queryName="demographics" useTransaction="false" />

Batch Process Transactions

Note that batch processing transactions risks leaving the destination or target table in an intermediate state if an error occurs during ETL processing.

By default an single ETL job will be run as a single transaction, no matter how many rows are processed. You can change the default behavior by specifying that a new transaction be committed for every given number of rows processed. In the example below, a new transaction will be committed for every 500 rows processed:

<destination schemaName="study" queryName="demographics" bulkLoad="true" batchSize="500" />

Command Tasks

Once a command task has been registered in a pipeline task xml file, you can specify the task as an ETL step.

<transform id="ProcessingEngine" type="ExternalPipelineTask" 
externalTaskId="org.labkey.api.pipeline.cmd.CommandTask:myEngineCommand"/>

See this example module for an ETL that calls a pipeline job: ETLPipelineTest.zip

Permission to Run

ETL processes are run in the context of a folder. If run manually, they run with the permissions of the initiating user. If scheduled, they will run with the permissions of a "service user" which can be configured by the folder administrator.

Related Topics

Was this content helpful?

Log in or register an account to provide feedback


previousnext
 
expand allcollapse all