Scripting pipelines with more than one input file

LabKey Support Forum (Inactive)
Scripting pipelines with more than one input file eva pujadas  2016-10-04 08:51
Status: Closed
 
Dear LabKey supporters,

Is it possible to specify more than one input file when defining a task?
E.g.:
<task xmlns="http://labkey.org/pipeline/xml"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:type="ScriptTaskType"
    name="HelloWorld"
    version="1.0">
  <description>Simple Hello World R task</description>
  <inputs>
    <file name="input.csv" required="true" />
    <file name="input2.csv" required="true" />
  </inputs>
  <script file="HelloWorld.R"/>
</task>

How would then the both input files be read from the called script "HelloWorld.R"?
I've tried it, and using ${input.csv} and ${input2.csv} does not work. It reports the error message that the file does not exist.

Thank you very much,
Eva
 
 
kevink responded:  2016-10-04 10:23
Unfortunately, for historical reasons, the pipeline task's input files are identified by file extension rather than the actual input file name. If input2.csv is changed to input2.tsv, I believe your example will work but the task will expect a single csv file and a single tsv file as inputs.

By default, when there are multiple files for a single input name, we will launch the task multiple times -- once for each input file. In your case, however, you'd like to have multiple input files sent to a single task. To do this, add splitFiles="true" to the "input.csv" input in the task xml. Ideally we would expand the ${input.csv} token in the R script into a list of all the selected input files, however we haven't implemented it yet. As a workaround, you will need to read the taskInfo.tsv file to get the list of input files for the input.csv token. Here is some example code you can use:

jobInfo <- read.table("${pipeline, taskInfo}",
                      col.names=c("name", "value"),
                      header=FALSE, check.names=FALSE,
                      stringsAsFactors=FALSE, sep="\t", quote="",
                      fill=TRUE, na.strings="")

# collect all input files
inputFiles <- jobInfo$value[ grep("input\\.csv", jobInfo$name) ]


The HIPC group at the Fred Hutch have a pipeline script that uses multiple input files on github:

https://github.com/RGLab/LabKeyModules/blob/master/HIPCMatrix/pipeline/tasks/create-matrix.r
https://github.com/RGLab/LabKeyModules/blob/master/HIPCMatrix/pipeline/tasks/create-matrix.task.xml
 
eva pujadas responded:  2016-10-05 01:40
Hi Kevin,

Thanks a lot for this helpful and detailed explanation.

Best,
Eva