ETL processes, which Extract/Transform/Load
data, are run in the context of a folder. If run manually, they run with the permissions of the initiating user. If scheduled, they will run with the permissions of a "service user" which can be configured by the folder administrator.
The following configuration options are offered for customizing ETL processes:
You can set a polling schedule to check the source database for new data and automatically run the ETL process when new data is found. The schedule below checks every hour for new data:
<schedule><poll interval="1h" /></schedule>
Another automatic scheduling option is to set a time each day to run the job.
<!-- run at 10:15 every day -->
<schedule><cron expression="0 15 10 ? * *"/></schedule>
<!-- run at 3:30am every day -->
<schedule><cron expression="0 30 3 * * ?"/></schedule>
Use an cron expression builder for the Quartz cron format, for example: http://www.cronmaker.com/
See quartz documentation for more examples
When the data is loaded into the destination database, there are three options for handling cases when the source query returns key values that already exist in the destination:
- Append: Appends new rows to the end of the existing table. Fails on duplicate primary key values.
- Merge: Merges data into the destination table. Matches primary key values to determine insert or update. Target tables must have a primary key.
- Truncate: Deletes the contents of the destination table before inserting the selected data.
<destination schemaName="vehicle" queryName="targetQuery" targetOption="merge" />
Note: Merge and truncate are only supported for datasets, not lists.
The filter strategy, defined in the incrementalFilter tag, is how the ETL process identifies new rows in the source database. The strategy allows a special value on the destination table to be compared to the source and only pulls over new rows based on that value. Using an incrementalFilter allows you to use the append option to add new rows to your target table and not accidentally run into any duplicate record conflicts. There are three options:
- SelectAllFilterStrategy: Apply no further filter to the source; simply transform/transfer all rows returned by the query.
- RunFilterStrategy: Check a specified column, typically an increasing integer column (e.g. Run ID), against a given or stored value. For instance, any rows with a higher value than when the ETL process was last run are transformed.
- ModifiedSinceFilterStrategy: Use a specified date/timestamp column (timeStampColumnName) to identify the new records. Rows changed since the last run will be transformed.
For example, the strategy below says to check for updated data by consulting the "Date" field.
<incrementalFilter className="ModifiedSinceFilterStrategy" timestampColumnName="Date" />
Incremental Deletion of Target Rows
When incrementally deleting rows based on a selective filter strategy, use the element deletedRowsSource
to correctly track the filtered values for deletion independently of the main query. Even if there are no new rows in the source query, any new records in the "deleteRowsSource" will still be found and deleted from the source. Using this method, the non-deleted rows will keep their row ids, maintaining any links to other objects in the target table.
An ETL process can load data to a file, such as a comma separated file (CSV), instead of loading data into a database table. For example, the following ETL configuration element directs outputs to a tab separated file named "report.tsv". rowDelimiter and columnDelimiter are optional, if omitted you get a standard TSV file.
<destination type="file" dir="etlOut" fileBaseName="report" fileExtension="tsv" />
Transact Multi-Step ETL
A multi-step ETL can be wrapped in a single transaction on the source and/or destination database side, ensuring that the entire process will proceed on a consistent set of source data, and that the entire operation can be rolled back if any error occurs. This option should be considered only if
- Every step in the ETL uses the same source and/or destination scope. Individual steps may use different schemas within these sources.
- Your desired operation cannot be supported using "modified" timestamps or "transfer id" columns, which are the preferred methods for ensuring data consistency across steps.
Note that transacting the source database is only supported for Postgres. SQL Server does not support the required SNAPSHOT isolation on REPEATABLE READ required to avoid locking issues.
To enable multi-step transactions, use the the “transactSourceSchema” and “transactDestinationSchema” attributes on the top level “etl” element in an ETL xml:
<etl xmlns="http://labkey.org/etl/xml" transactSourceSchema="SOURCE_SCHEMA_NAME" transactDestinationSchema="DESTINATION_SCHEMA_NAME">
The specified schema names can be any schema in the datasource of which LabKey is aware, i.e. any schema in the LabKey datasource, or any external schema mapped in an external datasource. Schema names are used instead of datasource names because the schema names are known to the etl author, whereas datasource names can be arbitrary and set at server setup time.
Note that disabling transactions risks leaving the destination or target table in an intermediate state if an error occurs during ETL processing.
ETL steps are, by default, run as transactions. To turn off transactions when running an ETL process, set useTransaction to false on the destination, as shown below:
<destination schemaName="study" queryName="demographics" useTransaction="false" />
Batch Process Transactions
Note that batch processing transactions risks leaving the destination or target table in an intermediate state if an error occurs during ETL processing.
By default an single ETL job will be run as a single transaction, no matter how many rows are processed. You can change the default behavior by specifying that a new transaction be committed for every given number of rows processed. In the example below, a new transaction will be committed for every 500 rows processed:
<destination schemaName="study" queryName="demographics" bulkLoad="true" batchSize="500" />
Once a command task has been registered in a pipeline task xml file
, you can specify the task as an ETL step.
<transform id="ProcessingEngine" type="ExternalPipelineTask"
See this example module for an ETL that calls a pipeline job: ETLPipelineTest.zip