Transform Scripts: /Documentation/Archive/21.11

Transform Scripts

Transform scripts are attached to assay designs, run before the assay data is imported, and can reshape the data file to match the expected import format. Several scripts can run sequentially to perform different transformations. The extension of the script file identifies the scripting engine that will be used to run the validation script. For example, a script named test.pl will be run with the Perl scripting engine.

Overview
Scripting Prerequisites
How Transformation Scripts Work
Use Transformation Scripts

Associate the Script with an Assay
Passing Run Properties to Transformation Scripts

Transform scripts (which are always attached to assay designs) are different from trigger scripts, which are attached to a dataset (database table or query).

Overview

A wide range of scenarios can be addressed using transform scripts. For example:

Instrument-generated files often contain header lines before the main data table, denoted by a leading #, !, or other symbol. These lines may contain useful metadata about the protocol, reagents, or samples tested which should either be incorporated into the data import or skipped over to find the main data table.
File or data formats might be optimized for display, not for efficient storage and retrieval. Transformation scripts can clean, validate, and reformat imported data.
During import, display values from a lookup column may need to be mapped to foreign key values for storage.
You may need to fill in additional quality control values with imported assay data, or calculate contents of a new column from columns in the imported data.
Inspect and change the data or populate empty columns in the data. Modify run- and batch-level properties. If validation only needs to be done for particular single field values, the simpler mechanism is to use a validator within the field properties for the column.

Scripting Prerequisites

Any scripting language that can be invoked via the command line and has the ability to read/write files is supported for transformation scripts, including:

Perl
Python
R
Java

Before you can run scripts, you must configure the necessary scripting engine on your server.

How Transformation Scripts Work

Script Execution Sequence - Transformation and validation scripts are invoked in the following sequence:

A user imports assay result data.
The server creates a runProperties.tsv file and rewrites the assay result data into tab-separated format (TSV). Assay-specific properties and files from both the run and batch level fields are incorporated. See Run Properties Reference for full lists of properties.
The server invokes the transform script by passing it the information created in step 2 (the runProperties.tsv file).
After script completion, the server checks whether any errors have been written by the transform script and whether any data has been transformed.
If transformed data is available, the server uses it for subsequent steps; otherwise, the original data is used.
If multiple transform scripts are specified, the server invokes the other scripts in the order in which they are defined.
Field-level validator/quality-control checks (including range and regular expression validation) are performed. (These field-level checks are defined in the assay definition.)
If no errors have occurred, the run is loaded into the database.

Use Transform Scripts

Each assay design can be associated with one or more validation or transform scripts which are run in the order specified.

This section describes the process of using a transform script that has already been developed for your assay type. An example workflow for how to create an assay transform script in perl can be found in Example Workflow: Develop a Transformation Script (perl).

Identifying the Path to the Script File

It is convenient to upload the script file to the File Repository in the same folder as the assay design. The absolute path to the script file can be determined by concatenating the file root for the folder (available at (Admin) > Folder > Management > Files tab) plus the path to the script file in the Files web part (for example, "@files\scripts\LoadData.R"). In the file path, LabKey Server accepts either backslashes (the default Windows format) or forward slashes.

Example path to script:

/labkey/labkey/files/MyProject/MyAssayFolder/@files/MyTransformScript.R

When working on your own developer workstation, you can put the script file wherever you like, but putting it within the File Repository will make it easier to deploy to a production server. It also makes iterative development against a remote server easier, since you can use a Web-DAV enabled file editor to directly edit the same script file that the server is calling.

When you decide where to locate your transform script file, consider that it is convenient to keep script files in the same location as the assay. You will enter the full path when you configure your assay design to run this script. Use the built-in substitution token "${srcDirectory}" which the server automatically fills in to be the directory where the called script file (the one identified in the Transform Scripts field) is located.

Accessing and Using the Run Properties File

The primary mechanism for communication between the LabKey Assay framework and the Transform script is the Run Properties file. Again a substitution token ${runInfo} tells the script code where to find this file. The script file should contain a line like

run.props = labkey.transform.readRunPropertiesFile("${runInfo}");

The run properties file contains three categories of properties:

1. Batch and run properties as defined by the user when creating an assay instance. These properties are of the format: <property name> <property value> <java data type>

for example,

gDarkStdDev 1.98223 java.lang.Double

An example Run Properties file to examine: runProperties.tsv

When the transform script is called these properties will contain any values that the user has typed into the “Batch Properties” and “Run Properties” sections of the import form. The transform script can assign or modify these properties based on calculations or by reading them from the raw data file from the instrument. The script must then write the modified properties file to the location specified by the transformedRunPropertiesFile property.

2. Context properties of the assay such as assayName, runComments, and containerPath. These are recorded in the same format as the user-defined batch and run properties, but they cannot be overwritten by the script.

3. Paths to input and output files. These are absolute paths that the script reads from or writes to. They are in a <property name> <property value> format without property types. The paths currently used are:

a. runDataUploadedFile: The assay result file selected and imported to the server by the user. This can be an Excel file (XLS, XLSX), a tab-separated text file (TSV), or a comma-separated text file (CSV).
b. runDataFile: The file produced after the assay framework converts the user imported file to TSV format. The path will point to a subfolder below the script file directory, with a path value similar to <property value> <java property type>. The AssayId_22\42 part of the directory path serves to separate the temporary files from multiple executions by multiple scripts in the same folder.

C:\labkey\files\transforms\@files\scripts\TransformAndValidationFiles\AssayId_22\42\runDataFile.tsv

c. AssayRunTSVData: This file path is where the result of the transform script will be written. It will point to a unique file name in an “assaydata” directory that the framework creates at the root of the files tree. NOTE: this property is written on the same line as the runDataFile property.
d. errorsFile: This path is where a transform or validation script can write out error messages for use in troubleshooting. Not normally needed by an R script because the script usually writes errors to stdout, which are written by the framework to a file named “<scriptname>.Rout”.
e. transformedRunPropertiesFile: This path is where the script writes out the updated values of batch- and run-level properties that are listed in the runProperties file.

Choosing the Input File for Transform Script Processing

The transform script developer can choose to use either the runDataFile or the runDataUploadedFile as its input. The runDataFile would be the right choice for an Excel-format raw file. By using the runDataFile, the assay framework does the Excel-to-TSV conversion and the script doesn't need to know how to parse Excel files. The runDataUploadedFile would be the right choice if, for example, the original file is already in TSV format or when the conversion process does not produce a useable TSV file.

A Python example that loads the original imported results file...

fileRunProperties = open(filePathRunProperties, "r")
for l in fileRunProperties:
    row = l.split()
    if row[0] == "runDataUploadedFile":
        filePathIn = row[1]

… and one that loads the inferred TSV file

fileRunProperties = open(filePathRunProperties, "r")
for l in fileRunProperties:
    row = l.split()
    if row[0] == "runDataFile":
        filePathIn = row[1]

Associate the Script with an Assay

To specify a transform script in an assay design, you enter the full path including the file extension in the Transform Script field.

Open the assay designer for a new assay, or edit an existing assay design.
Click Add Script.
Enter the full path to the script in the Transform Scripts field.
You may enter multiple scripts by clicking Add Script again. Delete scripts by clicking the X to the right of the row.

Confirm that other properties and fields required by your assay are correctly specified.
Scroll down and click Save.

When you import (or re-import) run data using this assay design, the script will be executed.

There are two useful options presented as checkboxes in the Assay designer.

Save Script Data for Debugging tells the framework to not delete the intermediate files such as the runProperties file after a successful run. This option is important during script development. It can be turned off to avoid cluttering the file space under the TransformAndValidationFiles directory that the framework automatically creates under the script file directory.
Import In Background tells the framework to create a pipeline job as part of the import process, rather than tying up the browser session. It is useful for importing large data sets.

A few notes on usage:

Client API calls are not supported in transform scripts.
Columns populated by transform scripts must already exist in the assay definition.
Executed scripts show up in the experimental graph, providing a record that transformations and/or quality control scripts were run.
Transform scripts are run before field-level validators.
The script is invoked once per run upload.
Multiple scripts are invoked in the order they are listed in the assay design.
Note that non-programmatic quality control remains available -- assay designs can be configured to perform basic checks for data types, required values, regular expressions, and ranges. Learn more in these topics: Field Editor and Dataset QC States - Admin Guide.

The general purpose assay tutorial includes another example use of a transform script in Set up a Data Transform Script.

Passing Run Properties to Transform Scripts

Information on run properties can be passed to a transform script in two ways. You can put a substitution token into your script to identify the run properties file, or you can configure your scripting engine to pass the file path as a command line argument. See Transformation Script Substitution Syntax for a list of available substitution tokens.

For example, using perl:

Option #1: Put a substitution token (${runInfo}) into your script and the server will replace it with the path to the run properties file. Here's a snippet of a perl script that uses this method:

# Open the run properties file. Run or upload set properties are not used by
# this script. We are only interested in the file paths for the run data and
# the error file.

open my $reportProps, '${runInfo}';

Option #2: Configure your scripting engine definition so that the file path is passed as a command line argument:

Go to (Admin) > Site > Admin Console.
Under Configuration, click Views and Scripting.
Select and edit the perl engine.
Add ${runInfo} to the Program Command field.

LabKey Support

LabKey Support