Handling deletes on ETL where the source is a LK dataset?

LabKey Support Forum
Handling deletes on ETL where the source is a LK dataset? Ben Bimber  2018-04-11 13:28
Status: Closed
 

The ETL allows you to have a RemoteQuery source, such as a dataset on another server. If the source table is large and relatively stagnant, doing an incremental merge (based on the modified field) is in theory nice. However, I need to deal with the potential of deletes from the source table. In our case, the source has auditing, so I assume I could use the audit tables to try to figure that out (although queries against this are painfully slow). Is there an example of an existing ETL that selects from datasets, dealing with deletes?

Thanks,
Ben

 
 
Jon (LabKey DevOps) responded:  2018-04-11 15:50
I'll get this confirmed, but I don't think we have anything beyond the standard "append", "merge", and "truncate" options on the target table. and only filtering strategies of "SelectAllFilterStrategy", "RunFilterStrategy", and "ModifiedSinceFilterStrategy" for the source.

https://www.labkey.org/Documentation/wiki-page.view?name=etlSchedule#target

Regards,

Jon
 
Ben Bimber responded:  2018-04-11 15:59
So incrementalFilter has deletedRowsSource, which I assume is the select that identifies a list of PKs to remove; however, that needs to come from somewhere. Auditing is really the only can I can think to come up with this list of IDs. Because we're pulling from a fairly standard LK data source (datasets), I thought I'd see if any groups have experience with how to approach this.
 
Jon (LabKey DevOps) responded:  2018-04-20 12:13
Hi Ben,

You're right, https://www.labkey.org/Documentation/wiki-page.view?name=etlSchedule#deletedRowsSource would work in this case, but this typically is used on external sources and a database trigger would populate a deletedRowsSource (like an audit table) on the deletes.

You could use a query over the LabKey audit table like you're thinking, but it would be slow as you've seen. Indexing could possibly help speed things up here. But another alternative would be to create an afterDelete js trigger script on the dataset to populate a second (custom) table as the deletedRowsSource.

Regards,

Jon