ETL processes, which Extract/Transform/Load
data, are run in the context of a folder. If run manually, they run with the permissions of the initiating user. If scheduled, they will run with the permissions of a "service user" which can be configured by the folder administrator.
The following configuration options are offered for customizing ETL processes:
You can set a polling schedule to check the source database for new data and automatically run the ETL process when new data is found. The schedule below checks every hour for new data:
<schedule><poll interval="1h" /></schedule>
Another automatic scheduling option is to set a time each day to run the job.
<!-- run at 10:15 every day -->
<schedule><cron expression="0 15 10 ? * *"/></schedule>
<!-- run at 3:30am every day -->
<schedule><cron expression="0 30 3 * * ?"/></schedule>
Use an cron expression builder for the Quartz cron format, for example: http://www.cronmaker.com/
See quartz documentation for more examples
When the data is loaded into the destination database, there are three options for handling cases when the source query returns key values that already exist in the destination:
- Append: Appends new rows to the end of the existing table. Fails on duplicate primary key values.
- Merge: Merges data into the destination table. Matches primary key values to determine insert or update. Target tables must have a primary key.
- Truncate: Deletes the contents of the destination table before inserting the selected data.
<destination schemaName="vehicle" queryName="targetQuery" targetOption="merge" />
Note: Merge and truncate are only supported for datasets, not lists.
The filter strategy, defined in the incrementalFilter tag, is how the ETL process identifies new rows in the source database. The strategy allows a special value on the destination table to be compared to the source and only pulls over new rows based on that value. Using an incrementalFilter allows you to use the append option to add new rows to your target table and not accidentally run into any duplicate record conflicts. There are three options:
- SelectAllFilterStrategy: Apply no further filter to the source; simply transform/transfer all rows returned by the query.
- RunFilterStrategy: Check a specified column, typically an increasing integer column (e.g. Run ID), against a given or stored value. For instance, any rows with a higher value than when the ETL process was last run are transformed.
- ModifiedSinceFilterStrategy: Use a specified date/timestamp column (timeStampColumnName) to identify the new records. Rows changed since the last run will be transformed.
For example, the strategy below says to check for updated data by consulting the "Date" field.
<incrementalFilter className="ModifiedSinceFilterStrategy" timestampColumnName="Date" />
Incremental Deletion of Target Rows
When incrementally deleting rows based on a selective filter strategy, use the element deletedRowsSource
to correctly track the filtered values for deletion independently of the main query. Even if there are no new rows in the source query, any new records in the "deleteRowsSource" will still be found and deleted from the source. Using this method, the non-deleted rows will keep their row ids, maintaining any links to other objects in the target table.
An ETL process can load data to a file, such as a comma separated file (CSV), instead of loading data into a database table. For example, the following ETL configuration element directs outputs to a tab separated file named "report.tsv". rowDelimiter and columnDelimiter are optional, if omitted you get a standard TSV file.
<destination type="file" dir="etlOut" fileBaseName="report" fileExtension="tsv" />
Note that these settings risk leaving the destination or target table in an intermediate state if an error occurs during ETL processing.
ETL jobs are, by default, run as transactions. To turn off transactions when running an ETL process, set useTransaction to false on the destination, as shown below:
<destination schemaName="study" queryName="demographics" useTransaction="false" />
By default an single ETL job will be run as a single transaction, no matter how many rows are processed. You can change the default behavior by specifying that a new transaction be committed for every given number of rows processed. In the example below, a new transaction will be committed for every 500 rows processed:
<destination schemaName="study" queryName="demographics" bulkLoad="true" batchSize="500" />
Once a command task has been registered in a pipeline task xml file
, you can specify the task as an ETL step.
<transform id="ProcessingEngine" type="ExternalPipelineTask"