Filter strategies define how the ETL process identifies new rows in the source database to be pulled over to the target.
ETLs move data from a source to a destination. The filter strategy is how the ETL process identifies new rows in the source database. Filter strategies use a designated field on the destination table to be compared to the source and only pulls over new rows based on that field. There are three options:
- SelectAllFilterStrategy: Get all data rows from the source, applying no filter.
- ModifiedSinceFilterStrategy: Select the latest changes to the source table. Uses a date/timestamp column (specified by timeStampColumnName) to identify the new/updated records since the last ETL job run. Rows changed since the last run will be selected. This is the most commonly used filter strategy.
- RunFilterStrategy: Check a specified column, typically an increasing integer column (e.g. Run ID), against a given or stored value. For instance, any rows with a higher value than when the ETL process was last run are transformed. Often used for relational data, multi-staged transfer pipelines, or when an earlier upstream process is writing a batch of parent-child data to the source. Useful when child records must be transferred at the same time as the parent records
For example, the strategy below says to check for updated data by consulting the "Date" field.
<incrementalFilter className="ModifiedSinceFilterStrategy" timestampColumnName="Date" />
Incremental Deletion of Target Rows
When incrementally deleting rows based on a selective filter strategy, use the element deletedRowsSource
to correctly track the filtered values for deletion independently of the main query. Even if there are no new rows in the source query, any new records in the "deletedRowsSource" will still be found and deleted from the target. Using this method, the non-deleted rows will keep their row ids, maintaining any links to other objects in the target table.
When the data is loaded into the destination database, there are three options for handling cases when the source query returns key values that already exist in the destination:
- Append: The default value. Appends new rows to the end of the existing table. Fails on duplicate primary key values.
- Merge: Merges data into the destination table. Matches primary key values to determine insert or update. Target tables must have a primary key. Not supported for tables with auto-incrementing integer keys.
- Truncate: Deletes the contents of the destination table before inserting the selected data.
<destination schemaName="vehicle" queryName="targetQuery" targetOption="merge" />
An ETL process can load data to a file, such as a comma separated file (CSV), instead of loading data into a database table. For example, the following ETL configuration element directs outputs to a tab separated file named "report.tsv". rowDelimiter and columnDelimiter are optional, if omitted you get a standard TSV file.
<destination type="file" dir="etlOut" fileBaseName="report" fileExtension="tsv" />