ETL dataset with a large IN filter? | Ben Bimber | 2018-04-03 16:02 | |||||||||||||||||||||||||||||||||
Status: Closed | |||||||||||||||||||||||||||||||||||
I'd like to ETL a large dataset from one server to a second labkey server. The source table has data on a large # of subjects; however, I only want data from about 1000-2000 of them, and that list changes regularly. That whitelist of subjects only exists on the destination server (in a table). Is there a good way to implement this type of filter in the ETL? I am assuming an IN clause (which is basically what the ETL presents) would be a problem with this many IDs, but I havent formally tested that. I am considering requesting the entire dataset in the ETL, and then trying to add some kind of java transform code that would run on the destination server to skip over rows I dont want (probably not the most efficient, but maybe better than nothing). The best case would probably be to make that whitelist of IDs available to the source server in a database table, so I could write a SQL join query, and then request this query unfiltered on my destination server. It would in theory be easy enough to make an ETL in the opposite direction to sync the whitelist of IDs from the destination server to the source server; however, for a variety of reasons I cant practically deploy code to the source server without a lot of hoops. My understanding is that I'd need code in order to add this ETL, is that right? Is there another way to think about this problem that I'm missing? Thanks |
|||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||