Premium Feature — Available with all Premium Editions of LabKey Server. Learn more or contact LabKey.

This topic covers the types of tasks available in the <transform> element of an ETL. Each element sets the type to an available task:

Transform Task

The basic transform task is org.labkey.di.pipeline.TransformTask. Syntax looks like:

...
<transform id="step1" type="org.labkey.di.pipeline.TransformTask">
<description>Copy to target</description>
<source schemaName="etltest" queryName="source" />
<destination schemaName="etltest" queryName="target" />
</transform>
...

The <source> refers to a schemaName that appears local to the container, but may in fact be an external schema or linked schema that was previously configured. For example, if you were referencing a source table in a schema named "myLinkedSchema", you would use:

...
<transform id="step1" type="org.labkey.di.pipeline.TransformTask">
<description>Copy to target</description>
<source schemaName="myLinkedSchema" queryName="source" />
<destination schemaName="etltest" queryName="target" />
</transform>
...

Remote Query Transform Step

In addition to supporting use of external schemas and linked schemas, ETL modules can access data through a remote connection to an alternate LabKey Server.

To set up a remote connection, see ETL: Manage Remote Connections.

The transform type is RemoteQueryTransformStep and your <source> element must include the remoteSource in addition to the schema and query name on that remoteSource as shown below:

...
<transform type="RemoteQueryTransformStep" id="step1">
<source remoteSource="EtlTest_RemoteConnection" schemaName="study" queryName="etl source" />
… <!-- the destination and other options for the transform are included here -->
</transform>
...

Note that using <deletedRowSource> with an <incrementalFilter> strategy does not support a remote connection.

Queue Job Task

Calling an ETL from another ETL is accomplished by using the ETL type TaskRefTransformStep and including a <taskref> that refers to org.labkey.di.steps.QueueJobTask.

Learn more and see syntax examples in this topic:

Run Report Task

Note: Currently only R reports are supported in this feature.

This task can be used to kick off the running of an R report. This report will run in the background outside the context of the ETL and does not 'participate' in transformation chains or write to destinations within the ETL. It also will not have access to the automatic "labkey.data" frame available to R reports running locally, but instead will check the directory where it is run for a file named input_data.tsv to use as the dataframe. If such a file is not available, you can use the Rlabkey api to query for the data and then just assign that dataframe to labkey.data.

To kick off the running of an R report from an ETL, first create your report. You will need the reportID and names/expected values of any parameters to that report. The reportID can be either:

  • db:<###> - use the report number as seen on the URL when viewing the report in the UI.
  • module:<report path>/<report name> - for module-defined reports, the number is not known up front. Instead you can provide the module and path to the report by name.
In addition, the report must set the property "runInBackground" to "true" to be runnable from an ETL. In a <ReportName>.report.xml file, the ReportDescriptor would look like:
<ReportDescriptor descriptorType="rReportDescriptor" reportName="etlReport" xmlns="http://labkey.org/query/xml">
<Properties>
<Prop name="runInBackground">true</Prop>
</Properties>
<tags/>
</ReportDescriptor>

Within your ETL, include a transform of type TaskRefTransformStep that calls the <taskref> org.labkey.di.pipeline.RunReportTask. Syntax looks like:

...
<transform id="step1" type="TaskRefTransformStep">
<taskref ref="org.labkey.di.pipeline.RunReportTask">
<settings>
<setting name="reportId" value="db:307"/>
<setting name="myparam" value="myvalue"/>
</settings>
</taskref>
</transform>
...

Learn more about writing R reports in this topic:

Add New TaskRefTask

The queueing and report running tasks above are implemented using the TaskRefTask. This is a very flexible mechanism, allowing you to provide any Java code to be run by the task on the pipeline thread. This task does not have input or output data from the ETL pipeline, it is simply a Java thread, with access to the Java APIs, that will run synchronously in the pipeline queue.

To add a new TaskRefTask, write your Java code in a module and reference it using syntax similar to the above for the RunReportTask. If the module includes "MyTask.java", syntax for calling it from an ETL would look like:

...
<transform id="step1" type="TaskRefTransformStep">
<taskref ref="[Module path].MyTask">
...
</taskref>
</transform>
...

Stored Procedures

When working with a stored procedure, you use a <transform> of type StoredProcedure. Syntax looks like:

...
<transform id="ExtendedPatients" type="StoredProcedure">
<description>Calculates date of death or last contact for a patient, and patient ages at events of interest</description>
<procedure schemaName="patient" procedureName="PopulateExtendedPatients" useTransaction="true">
</procedure>
</transform>
...

External Pipeline Task - Command Tasks

Once a command task has been registered in a pipeline task xml file, you can specify the task as an ETL step. In this example, "myEngineCommand.pipeline.xml" is already available. It could be incorporated into an ETL with syntax like this:

...
<transform id="ProcessingEngine" type="ExternalPipelineTask"
externalTaskId="org.labkey.api.pipeline.cmd.CommandTask:myEngineCommand"/>

...

To see a listing of all the registered pipeline tasks on your server, including their respective taskIds:

  • Select (Admin) > Site > Admin Console.
  • Under Diagnostics, click Pipelines and Tasks.

Related Topics

Was this content helpful?

Log in or register an account to provide feedback


previousnext
 
expand allcollapse all