This topic covers the types of tasks available in the <transform> element of an ETL. Each element sets the
type to an available task:
Transform Task
The basic transform task is
org.labkey.di.pipeline.TransformTask. Syntax looks like:
...
<transform id="step1" type="org.labkey.di.pipeline.TransformTask">
<description>Copy to target</description>
<source schemaName="etltest" queryName="source" />
<destination schemaName="etltest" queryName="target" />
</transform>
...
The <source> refers to a schemaName that appears local to the container, but may in fact be an
external schema or
linked schema that was previously configured. For example, if you were referencing a source table in a schema named "myLinkedSchema", you would use:
...
<transform id="step1" type="org.labkey.di.pipeline.TransformTask">
<description>Copy to target</description>
<source schemaName="myLinkedSchema" queryName="source" />
<destination schemaName="etltest" queryName="target" />
</transform>
...
Remote Query Transform Step
In addition to supporting use of
external schemas and
linked schemas, ETL modules can access data through a remote connection to an alternate LabKey Server.
To set up a remote connection, see
ETL: Manage Remote Connections.
The transform type is
RemoteQueryTransformStep and your <source> element must include the
remoteSource in addition to the schema and query name on that remoteSource as shown below:
...
<transform type="RemoteQueryTransformStep" id="step1">
<source remoteSource="EtlTest_RemoteConnection" schemaName="study" queryName="etl source" />
… <!-- the destination and other options for the transform are included here -->
</transform>
...
Note that using <deletedRowSource> with an <incrementalFilter> strategy does not support a remote connection.
Queue Job Task
Calling an ETL from another ETL is accomplished by using the ETL type
TaskRefTransformStep and including a <taskref> that refers to
org.labkey.di.steps.QueueJobTask.
Learn more and see syntax examples in this topic:
Run Report Task
Note: Currently only R reports are supported in this feature.
This task can be used to kick off the running of an R report. This report will run in the background outside the context of the ETL and does not 'participate' in transformation chains or write to destinations within the ETL. It also will not have access to the automatic "labkey.data" frame available to R reports running locally, but instead will check the directory where it is run for a file named input_data.tsv to use as the dataframe. If such a file is not available, you can
use the Rlabkey api to query for the data and then just assign that dataframe to labkey.data.
To kick off the running of an R report from an ETL, first create your report. You will need the
reportID and names/expected values of any parameters to that report. The reportID can be either:
- db:<###> - use the report number as seen on the URL when viewing the report in the UI.
- module:<report path>/<report name> - for module-defined reports, the number is not known up front. Instead you can provide the module and path to the report by name.
In addition, the report must set the property "runInBackground" to "true" to be runnable from an ETL. In a <ReportName>.report.xml file, the ReportDescriptor would look like:
<ReportDescriptor descriptorType="rReportDescriptor" reportName="etlReport" xmlns="http://labkey.org/query/xml">
<Properties>
<Prop name="runInBackground">true</Prop>
</Properties>
<tags/>
</ReportDescriptor>
Within your ETL, include a transform of type
TaskRefTransformStep that calls the <taskref>
org.labkey.di.pipeline.RunReportTask. Syntax looks like:
...
<transform id="step1" type="TaskRefTransformStep">
<taskref ref="org.labkey.di.pipeline.RunReportTask">
<settings>
<setting name="reportId" value="db:307"/>
<setting name="myparam" value="myvalue"/>
</settings>
</taskref>
</transform>
...
Learn more about writing R reports in this topic:
Add New TaskRefTask
The
queueing and
report running tasks above are implemented using the
TaskRefTask. This is a very flexible mechanism, allowing you to provide any Java code to be run by the task on the pipeline thread. This task does not have input or output data from the ETL pipeline, it is simply a Java thread, with access to the Java APIs, that will run synchronously in the pipeline queue.
To add a new
TaskRefTask, write your Java code in a module and reference it using syntax similar to the above for the
RunReportTask. If the module includes "MyTask.java", syntax for calling it from an ETL would look like:
...
<transform id="step1" type="TaskRefTransformStep">
<taskref ref="[Module path].MyTask">
...
</taskref>
</transform>
...
Stored Procedures
When working with a
stored procedure, you use a <transform> of type
StoredProcedure. Syntax looks like:
...
<transform id="ExtendedPatients" type="StoredProcedure">
<description>Calculates date of death or last contact for a patient, and patient ages at events of interest</description>
<procedure schemaName="patient" procedureName="PopulateExtendedPatients" useTransaction="true">
</procedure>
</transform>
...
External Pipeline Task - Command Tasks
Once a command task has been registered in a
pipeline task xml file, you can specify the task as an ETL step. In this example, "myEngineCommand.pipeline.xml" is already available. It could be incorporated into an ETL with syntax like this:
...
<transform id="ProcessingEngine" type="ExternalPipelineTask"
externalTaskId="org.labkey.api.pipeline.cmd.CommandTask:myEngineCommand"/>
...
To see a listing of all the registered pipeline tasks on your server, including their respective taskIds:
- Select (Admin) > Site > Admin Console.
- Under Diagnostics, click Pipelines and Tasks.
Related Topics