External Data Source for a Study?

LabKey Support Forum (Inactive)
External Data Source for a Study? trent  2011-06-07 19:00
Status: Closed
 
What's the best way to go about this?

I know I can define an external data source, and present this with a query web part, or an ext grid referencing the query name - but I don't think it's possible to integrate with a study's data sets? I.e. if I have an external data source where labs/sites are stored, I can define an external data source for that and present that on the page via a Query Web Part, but it doesn't tie it with the study overview - where it outputs the number of datasets, sites, etc (afaik).

I was suggested about the pipeline feature to load in the data-is this the best method? From my quick investigation, you point the study 'Reload' feature to a location - so that means on the external data, have a scheduled job that exports the data into the format that the reload feature expects, and configure labkey to refresh the data on a daily basis?

Im not really sure if I've got the right ideas or am on the right track.....

(Actually, its mainly lab results to be managed within lab key, with the rest of the data coming from external sources - if that matters any.)

I have toyed with the idea if I should just be creating a collaborative module, and building it that way - but since the study module has the base design down, seems like it'd be a fair bit of (perhaps unnecessary) extra work

Thanks.
 
 
adam responded:  2011-06-08 08:02
You have it right: external data sources can't be used as the source of study datasets... yet. The study reload feature is used to perform nightly reloads of study and specimen data from external systems. Typically, an organization writes a script or macro that exports from an external data source (SAS, a database server, a specimen management system, etc.) into a study archive (a zip file or directory containing XML and TSV files that define the data that needs to be refreshed... could be an entire study or just a subset) and then signals LabKey to reload it.

Our goal is to support reloading datasets directly from external datasources (the study archive would simply reference a query instead of a TSV file) and perhaps even support dynamic links from study into external data sources. This work is in the plan, though not currently top of the list for our current clients.

Another important aspect of external datasources if the ability to access (and modify) their data via the JavaScript, Java, R, etc. APIs. This allows building custom workflows and applications that act on external data.

Adam
 
trent responded:  2011-06-08 16:17
Oh, cool. Thanks.

So when you say "could be an entire study or just a subset" - if I have datasets maintained within labkey (as opposed to an external database), when I reload the study, that data won't get wiped? Or do I have to have that data in the study archive too (somehow)? i.e. If I have TSV files in an archive for data from the external sources, what happens to datasets not defined (which would be the datasets maintained within labkey) - as I assume the dataset is just deleted and recreated.

I had a skim through documentation but couldnt see mention of this situation.
 
marki responded:  2011-06-08 17:06
Yes, you can choose to reload only some of your datasets while keeping others intact.

Another option is to use the "Query Snapshot" feature of study queries to create a study dataset from an external schema.

You would create a new query in your study schema that selects the desired columns from your source schema. E.g. given an external schema "mark" I could create a query in the study schema like so:

select ptid as participantid, sequencenum, date from mark.ptidtest

You could then use Views->Create->Query Snapshot to create a dataset from the results of the query. You can choose whether you want that dataset updated automatically (every n minutes) or manually.
 
trent responded:  2011-06-08 17:17
So, for defining a study pipeline, i guess its best just to export the finished study to see how it all ties together and create a script to export data based on that. And referring to the xml reference https://www.labkey.org/download/schema-docs/xml-schemas/index.html?schemas/study_xsd/schema-frame.html ....

Oh, the query snapshot looks like a nice feature. Though, may still have to use the pipeline feature due to one of the external schemas being an Oracle db.
 
adam responded:  2011-06-08 17:31
Yes, using export to create your sample study archive works great. You can also find a few sample archives on labkey.org and checked into the /sampledata directory in our subversion repository.

We've been wanting to add support for Oracle external data sources... wouldn't be too hard...

Adam
 
trent responded:  2011-06-13 22:32
Hi, I am just getting around to actually giving this query snapshot a go, but not sure if I am doing the right thing...

Create Data Set
-> Go to menu Views - Create - Query Snapshot
-> Accept Defaults and click 'Create Snapshot'
(At this point, the snapshot view is displayed, which is just a query of the created data set)
-> Go to menu Views - Edit Snapshot

This is the point where I can't do anything. There is a textarea which is where I would expect to update the query source (is that correct?). So the issue is, the text area is set the readonly. I tried a few different things but it doesn't update (editing the query directly through the html inspector, removing the readonly attribute,...) after clicking the update button.

(before I forget, there is also another minor formatting issue. If you click 'Show History' on the Edit Query Snapshot page, the web part that shows the history has the html tags being displayed. there is the attribute that looks like this: title="<br/><b>Dataset Snapshot History:</b>" - I presume the html tags aren't meant to be in there..)

I also tried updating the studies security to all can edit, basic security with editable data sets, and custom security with editable data sets and setting everyone to edit all - in case that had something to do with not being able to update the query, but no change. Unless, each time you load the snapshot view it is supposed to return the original query (even after changing the source) or should it reflect what you updated it too last?

..

I think I may have to do the pipeline over the above scenario anyway, as I don't think you can configure labs and visits to be sourced externally.

..

On the Oracle data source - I have enlisted in the source and created a couple of new classes in the BigIron module, basically a copy of the MySqlDialect and MySqlDialectManager and also registered it as per the other dialects. I presume I then just need to add a resource in labkey.xml and get the jdbc driver for Oracle database placed into tomcat/common/lib (as per the guide for sas external data source). I just have to test, and hopefully it'll be all good; no doubt i'll be back asking more questions on this if it doesn't work/i cant figure it out.
 
marki responded:  2011-06-15 09:15
Unfortunately it's not possible to edit the query after the dataset is created. It needs to be re-created. We've had requests to change it but it's never been at the top of anyone's list. I still expect that to be fixed.

Yes, I think you will have to do the pipeline import, though you could do a mix of the two if you wanted.

As for the Oracle data source, glad to hear you are taking a crack at it. Let us know if you need any help. Starting with MySqlDialect seems like a good way to go, though I think syntactically Oracle may be closer to the pg dialect.