ETL dataset with a large IN filter?

LabKey Support Forum
ETL dataset with a large IN filter? Ben Bimber  2018-04-03 16:02
Status: Closed
 

I'd like to ETL a large dataset from one server to a second labkey server. The source table has data on a large # of subjects; however, I only want data from about 1000-2000 of them, and that list changes regularly. That whitelist of subjects only exists on the destination server (in a table). Is there a good way to implement this type of filter in the ETL? I am assuming an IN clause (which is basically what the ETL presents) would be a problem with this many IDs, but I havent formally tested that. I am considering requesting the entire dataset in the ETL, and then trying to add some kind of java transform code that would run on the destination server to skip over rows I dont want (probably not the most efficient, but maybe better than nothing).

The best case would probably be to make that whitelist of IDs available to the source server in a database table, so I could write a SQL join query, and then request this query unfiltered on my destination server. It would in theory be easy enough to make an ETL in the opposite direction to sync the whitelist of IDs from the destination server to the source server; however, for a variety of reasons I cant practically deploy code to the source server without a lot of hoops. My understanding is that I'd need code in order to add this ETL, is that right?

Is there another way to think about this problem that I'm missing?

Thanks

 
 
Matthew Bellew responded:  2018-04-03 16:07

I think you're on the right track with a two-step ETL.

Do you have permission to write into a table in the source database?

  • If yes, first ETL can copy the id's to the source database, and then select (with join) in the 'main' ETL.

  • If no, I might pursue a full copy into a 'staging' table then select with join into the final table.

 
Ben Bimber responded:  2018-04-03 16:15

that's an interesting idea. i do have write permission in the DBs. so I need to setup the source DB on my server as an external schema, right? This would solve the problem of pushing that whitelist to the source server.

is there some way I dont know about to do an executeSQL-type select in ETL instead of RemoteQueryTransformStep (which seems to require schema/query)? This is marginally preferable since the query definition would live on the destination server, in code, rather than my defining/saving SQL through the UI on the source server.

 
Matthew Bellew responded:  2018-04-03 16:30

The typical thing way is to add a labkey query (.sql) that goes in your module along with your ETL (.xml) definition. As in this tutorial.

https://www.labkey.org/Documentation/wiki-page.view?name=etlTutorial3

 
Ben Bimber responded:  2018-04-03 16:51

right, though this would ideally be present on the source server. pushing IDs from destination to source direct in the DB will be robust enough to get over my major problem. I'll just define the select queries in the DB of the source server (not perfect, but good enough for now), and write my ETL to target those remote queries. thanks.

 
Ben Bimber responded:  2018-04-03 16:56

Thinking about this, if the total # of records to push isnt huge, it is probably also possible to write an ETL task that simply uses the java API to do a remote insert, isnt it? There doesnt appear to be an existing API to cleanly pull up the credentials for saved connections (ideally you'd share the user/password saved for a given data source), but that probably wouldnt be a big leap.