Premium Feature — Available with all Premium Editions of LabKey Server. Learn more or contact LabKey.

Transact Multi-Step ETL

A multi-step ETL can be wrapped in a single transaction on the source and/or destination database side, ensuring that the entire process will proceed on a consistent set of source data, and that the entire operation can be rolled back if any error occurs. This option should be considered only if:

  • Every step in the ETL uses the same source and/or destination scope. Individual steps may use different schemas within these sources.
  • Your desired operation cannot be supported using "modified" timestamps or "transfer id" columns, which are the preferred methods for ensuring data consistency across steps.
Note that transacting the source database is only supported for Postgres. SQL Server does not support the required SNAPSHOT isolation on REPEATABLE READ required to avoid locking issues.

To enable multi-step transactions, use the the “transactSourceSchema” and “transactDestinationSchema” attributes on the top level “etl” element in an ETL xml:

<etl xmlns="http://labkey.org/etl/xml" transactSourceSchema="SOURCE_SCHEMA_NAME" transactDestinationSchema="DESTINATION_SCHEMA_NAME">

The specified schema names can be any schema in the datasource of which LabKey is aware, i.e. any schema in the LabKey datasource, or any external schema mapped in an external datasource. Schema names are used instead of datasource names because the schema names are known to the etl author, whereas datasource names can be arbitrary and set at server setup time.

Disable Transactions

ETL steps are, by default, run as transactions. To turn off transactions when running an ETL process, set useTransaction to false on the destination, as shown below:

<destination schemaName="study" queryName="demographics" useTransaction="false" />

Note that disabling transactions risks leaving the destination or target table in an intermediate state if an error occurs during ETL processing.

Batch Process Transactions

By default an single ETL job will be run as a single transaction, no matter how many rows are processed. You can change the default behavior by specifying that a new transaction be committed for every given number of rows processed. In the example below, a new transaction will be committed for every 500 rows processed:

<destination schemaName="study" queryName="demographics" bulkLoad="true" batchSize="500" />

Note that batch processing transactions risks leaving the destination or target table in an intermediate state if an error occurs during ETL processing.

Discussion

Was this content helpful?

Log in or register an account to provide feedback


previousnext
 
expand all collapse all