Using custom perl/python/R script in LabKey only when importing files?

Installation Forum (Inactive)
Using custom perl/python/R script in LabKey only when importing files? martin lague  2015-11-26 12:44
Status: Closed
 
Is it possible to add our own pipeline (perl/python/R scripts) elsewhere than when importing files? Is it possible to have them with the other analysis scripts/tools in the Sequence Analysis (DISCVR-Seq) module?

I've spent some time reading about creating pipelines, but I can only find examples and description of pipelines used when importing files, not to work on files already uploaded/analysed.
 
 
Jon (LabKey DevOps) responded:  2015-11-27 16:23
Hi Martin,

I'm not quite sure I'm understanding what you're asking. Can you give us some more details as to what you're looking to do here?

Within the UI, all pipelines can be added to any Project or Folder by going to Admin > Go To Module > Pipeline and then click the Setup button to setup a Pipeline Override. In order for anything within LabKey to be used by a module, script, or tool, the data has only a few places where it can come in from. Either as an uploaded file via the FileContent module or through a pipeline, an ETL to pull the data from an external database schema, or through some special API-based process that is integrated within a module.

What are you hoping to do with the Sequence Analysis module in conjunction with the pipeline sans-importing of files?

Regards,

Jon
 
Ben Bimber responded:  2015-11-28 08:45
Hi Martin,

I can say a little here, thought like Jon says a little more detail on what you're trying to do would help.

You probably read that LabKey pipelines can be defined and associated with particular file types as input (i.e. BAM, tiff, FASTQ). The main place LabKey exposes this is a user is through the file or pipeline webparts. If you select file(s) of that type and hit 'import', the user gets a list of available pipelines that they can kick off.

I'm guessing here about you goals, but I assume you want to kick off code with either different inputs (like DB records), or from a different place in the UI. For example, perhaps you already imported these files and have a table holding metadata and a pointer to the file. The latter is a lot of what we do in DISCVR-Seq. There are two basic options here:

1) LabKey R reports (and you can write python/perl reports too, i think) are one way to write and execute arbitrary code. The primary use of R reports (and primary focus of the docs), involves starting w/ a grid, selecting rows and switching to that report. Your code executes and is provided w/ a TSV of those rows. The use is heavily driving toward things like making a graph or a visualization. However, there is no reason the report couldnt do sequence processing, etc. Also note, this model is usually running synchronously (i.e. not a background job). You can execute reports using LABKEY.Report.execute(), which would let you create UI that kicks off some piece of code. In principle this mechanism should allow you to do a whole lot; however, we never ended up using this in our applications - getting the UI associated with it never seemed to work. Other options were easier. This might be our failing and not LabKey's.

2) You can also write standard file-based LabKey pipelines (like the import ones you describe) and write UI to kick them off from other places in the UI. DISCVR-Seq does do a little of this. For example, say you import FASTQ files, which really means adding records to a table with pointers to those files. From this grid, say you want to write a pipeline to run some QC tool. You could write javascript that takes the selected rows, finds the filepaths of the associated files, and then uses LABKEY.Pipeline.startAnalysis() to initiate your file-based pipeline on those files.

3) If you're specifically in sequence analysis and processing, DISCVR-Seq is another layer over LK's core pipeline. I wrote it with the goal of providing a layer to manage the data and handle much of the UI. It is written to allow other modules to register additional analysis steps or plugins. for example, if you wrote a script to calculate methlylation rates with BAM files as input, you could register that and it would appear in the UI. These currently do require a small amount of java code; however, I thought at some point we would probably support XML or other mechanisms similar to most resources in file-based modules. The VariantDB module is an example of another module registering resources w/ SequenceAnalysis (the primary DISCVR-Seq module). There are a couple examples I could point to where we have most of the logic for an analysis living in an external script, and java is basically just used as the glue to register it. If you're interested, I can write more.

-Ben
 
kevink responded:  2015-11-28 09:06
Just a few small notes: the LabKey file-based pipeline scripts can be written in R, perl, python or execute an external tool. Most of the examples import results into a LabKey assay, but they don't have to -- the pipeline script can interact with the server using one of the client libraries to insert records into tables or the pipeline script can just generate output files into the file system.

As Ben says, to kick off the pipeline script, you can start by navigating to a file in the file-browser and clicking the "Import Data" button. Alternatively, you can execute the pipeline via the LABKEY.Pipeline.startAnalysis() JavaScript API.

See here for more information:

https://www.labkey.org/wiki/home/Documentation/page.view?name=rPipeline
 
martin lague responded:  2015-12-01 10:08
Sorry for not being clear enough, but you were able to figure out what I needed.
My goal is to be able to upload a set of fastq files once and to run different custom made pipelines on it, without having to upload those files every time. There might already be a way to do that, but I didn't find it. Also, I would like to run custom pipelines on output files from other pipelines/tools (ex: BAM file from an alignment, cleaned files (short reads and low quality reads removed)).

@Ben: options 2 and 3 looks like what I would need. I'll start testing option 2, but I'll need more information for option 3. I'll have a look at the VariantDB module.
 
Ben Bimber responded:  2015-12-01 15:28
Hi Martin,

As you might have read, one of the main features of DISCVR-Seq is handling that initial import of FASTQ data and connecting it w/ sample information. Once imported, these are searchable and you can kick off additional analysis or pipelines. However you end up executing pipelines, you might find the data-management parts of DISCVR-Seq and schema to represent sequence data useful.

From the DISCVR-Seq perspective, there's two main ways you'd plug in new tools or analyses:

1) At least in our experience, an awful lot of sequence processing falls into the category of: FASTQ -> pre-processing -> alignment -> BAM post-processing. Tool wrappers can be registered for each of these action types (i.e. a wrapper for a new aligner). The UI lets a user chain any combination of these together, and save these workflows. You might check out MarkDuplicatesStep (picard mark duplicates) or BWAAlignmentStep as examples.

2) You can register handlers (i.e, a process that does something) that acts of specific file types. GBSAnalysisHandler from the VariantDB module is a relatively simple one. It is registered for BAM files, and runs a shell script to calculate a bunch of metrics related to coverage. This example uses java code to touch up the results and make an HTML report; however, in theory a handler could just run some external script that does all the work with next to no code in java.
 
martin lague responded:  2015-12-02 07:19
Hi Ben,
those 2 options looks like what we need.

Also, is there a way to contact you directly? If things goes as planned, we might be able to contribute to LabKey/DISCVR starting in January. This is not sure yet, but there is a great chance this happens.