Transformation Scripts in R: /home/

Transformation Scripts in R

Overview

Users importing instrument-generated tabular datasets into LabKey Server may run into the following difficulties:

Instrument-generated files often contain header lines before the main dataset, denoted by a leading # or ! or other symbol. These lines usually contain useful metadata about the protocol or reagents or samples tested, and in any case need to be skipped over to find the main data set.
The file format is optimized for display, not for efficient storage and retrieval. For example, columns that correspond to individual samples are difficult to work with in a database.
The data to be imported contains the display values from a lookup column, which need to be mapped to the foreign key values for storage.

All of these problems can be solved with a transform script. Transform scripts were originally designed to fill in additional columns such as quality control values in an imported assay data set. The assay framework, however, allows for transform scripts to solve a much wider range of challenges. And R is a good choice of language for writing transform scripts, because R contains a lot of built-in functionality for manipulating tabular data sets.

First we review the way to hookup a transform script to an assay and the communications mechanisms between the assay framework and a transform script in R.

Identifying the Path to the Script File

Transform scripts are designated as part of a assay by providing a fully qualified path to the script file in the field named at the top of the assay instance definition. A convenient location to put the script file is to upload it using a File web part defined in the same folder as the assay definition. Then the fully qualified path to the script file is the concatenation of the file root for the folder (for example, "C:\lktrunk\build\deploy\files\MyAssayFolderName\@files\", as determined by the Files page in the Admin console) plus the file path to the script file as seen in the File web part (for example, "scripts\LoadData.R". For the file path, LabKey Server accepts the use of either backslashes (the default Windows format) or forward slashes.

When working on your own developer workstation, you can put the script file wherever you like, but putting it within the scope of the File manager will make it easier to deploy to a server. It also makes iterative development against a remote server easier, since you can use a Web-DAV enabled file editor to directly edit the same file that the server is calling.

If your transform script calls other script files to do its work, the normal way to pull in the source code is using the source statement, for example

source("C:\lktrunk\build\deploy\files\MyAssayFolderName\@files\Utils.R")

But to keep the scripts so that they are easily moved to other servers, it is better to keep the script files together and the built-in substitution token "${srcDirectory}" which the server automatically fills in to be the directory where the called script file is located , for example:

source("${srcDirectory}/Utils.R");

Accessing and Using the Run Properties File

The primary mechanism for communication between the LabKey Assay framework and the Transform script is the Run Properties file. Again a substitution token tells the script code where to find this file. The script file should contain a line like

rpPath<- "${runInfo}"

When the script is invoked by the assay framework, the rpPath variable will contain the fully qualified path to the run properties file.

The run properties file contains three categories of properties:

1. Batch and run properties as defined by the user when creating an assay instance. These properties are of the format: <property name> <property value> <java data type>

for example,

gDarkStdDev 1.98223 java.lang.Double

When the transform script is called these properties will contain any values that the user has typed into the corresponding text box under the “Batch Properties” of “Run Properties” sections of the upload form. The transform script can assign or modify these properties based on calculations or by reading them from the raw data file from the instrument. The script must then write the modified properties file to the location specified by the transformedRunPropertiesFile property (see #3 below)

2. Context properties of the assay such as assayName, runComments, and containerPath. These are recorded in the same format as the user-defined batch and run properties, but they cannot be overwritten by the script.

3. Paths to input and output files. These are fully qualified paths that the script reads from or writes to. They are in a <property name> <property value> format without property types. The paths currently used are:

a. runDataUploadedFile: the raw data file that was selected by the user and uploaded to the server as part of an import process. This can be an Excel file, a tab-separated text file, or a comma-separated text file.
b. runDataFile: the imported data file after the assay framework has attempted to convert the file to .tsv format and match its columns to the assay data result set definition. The path will point to a subfolder below the script file directory, with a path value similar to <property value> <java property type>. The AssayId_22\42 part of the directory path serves to separate the temporary files from multiple executions by multiple scripts in the same folder.

C:\lktrunk\build\deploy\files\transforms\@files\scripts\TransformAndValidationFiles\AssayId_22\42\runDataFile.tsv

c. AssayRunTSVData: This file path is where the result of the transform script will be written. It will point to a unique file name in an “assaydata” directory that the framework creates at the root of the files tree. NOTE: this property is written on the same line as the runDataFile property.
d. errorsFile: This path is where a transform or validation script can write out error messages for use in troubleshooting. Not normally needed by an R script because the script usually writes errors to stdout, which are written by the framework to a file named “<scriptname>.Rout”.
e. transformedRunPropertiesFile: This path is where the script writes out the updated values of batch- and run-level properties that are listed in the runProperties file.

Choosing the Input File for Transform Script Processing

The transform script developer can choose to use either the runDataFile or the runDataUploadedFile as its input. The runDataFile would be the right choice for an Excel-format raw file and a script that fills in additional columns of the data set. By using the runDataFile, the assay framework does the Excel-to-TSV conversion and the script doesn’t need to know how to parse Excel files. The runDataUploadedFile would be the right choice for a raw file in TSV format that the script is going to reformat by turning columns into rows. In either case, the script writes its output to the AssayRunTSVData file.

Transform Script Options

There are two useful options presented as checkboxes in the Assay designer.

Save Script Data tells the framework to not delete the intermediate files such as the runProperties file after a successful run. This option is important during script development. It can be turned off to avoid cluttering the file space under the TransformAndValidationFiles directory that the framework automatically creates under the script file directory.
Upload In Background tells the framework to create a pipeline job as part of the import process, rather than tying up the browser session. It is useful for importing large data sets.

Connecting Back to the Server from a Transform Script

Sometimes a transform script needs to connect back to the server to do its job. One example is translating lookup display values into key values. The Rlabkey library available on CRAN has the functions needed to connect to, query, and insert or update data in the local LabKey Server where it is running. To give the connection the right security context (the current user’s), the assay framework provides the substitution token ${rLabkeySessionId}. Including this token on a line by itself near the beginning of the transform script eliminates the need to use a config file to hold a username and password for this loopback connection. It will be replaced with two lines that looks like:

labkey.sessionCookieName = "JSESSIONID" labkey.sessionCookieContents = "TOMCAT_SESSION_ID"

where TOMCAT_SESSION_ID is the actual ID of the user's HTTP session.

Debugging an R Transform Script

You can load an R transform script into the R console/debugger and run the script with debug(<functionname> commands active. Since the substitution tokens described above ( ${srcDirectory} , ${runInfo}, and ${rLabkeySessionId} ) are necessary to the correct operation of the script, the framework conveniently writes out a version of the script with these substitutions made, into the same subdiretory as the runProperties.tsv file is found. Load this modified version of the script into the R console.

Example Script

Setup

Create a new project, type Assay
Add the following Web parts:

Files
Lists
Data Pipeline
Sample Sets (narrow)

Copy the scripts folder from the data folder to the root of the Files web part tree
Create a sample set called ExampleSamples

Click on header of Sample Sets web part
Select Import Sample Set
Open the file samples.txt in a text editor or Excel (Click to download from page.)
Copy and paste the contents into the import window , select sampleId as the key field

Create a list called probesources by importing ProbeSourcesListArchive.zip (Click to download.)
Create a GPAT assay with transform script

Import the file ExampleAssayWithRTransform.xar (Click to download.)
Fix up the path to the script file TransformScriptExample.R (Click to download.)

Run the assay

Click on assay name
Import data button on toolbar
Select probe source from list, leave property Prefix, press Next
Column Names Row: 65
Sample Set: ExampleSamples
Run Data: Upload a data file. Choose file GSE11199_series_matrix_200.txt (Click to download.)
Save and finish

A Look at the Code

This transform script example handles the data output from an Affymetrics microarray reader. The data file contains 64 lines of metadata before the chip-level intensity data. The metadata describes the platform, the experiment, and the samples used. The spot-level data is organized with one column per sample, which may be efficient for storage in a spreadsheet but isn’t good for querying in a database.

The transform script does the following tasks:

Reads in the runProperties file
Gets additional import processing parameters from a lookup list, such as the prefix that designates a comment line containing a property-value pair
Fills in run properties that are read from the data file header (marked by the prefix). Writes the transformed run properties to the designated file location so they get stored with the assay.
Converts sample identifiers to sample set key values so that a lookup from result data to sample set properties works.
Skips over a specified number of rows to the beginning of the spot data.
Reshapes the input data so that the result set is easier to query by sample

The areas of the code that do these things are marked with the corresponding number.

TransformScriptExample.R

options(stringsAsFactors = FALSE) 
source("${srcDirectory}/ExampleUtils.R")
baseUrl<-"http://localhost:8080/labkey"

${rLabkeySessionId}
rpPath<- "${runInfo}"

## read the file paths etc out of the runProperties.tsv file
params <- getRunPropsList(rpPath, baseUrl)

## read the input data frame just  to get the column headers.
inputDF<-read.table(file=params$inputPathUploadedFile, header = TRUE, 
		sep = "\t", quote = "\"",  
		fill=TRUE, stringsAsFactors = FALSE, check.names=FALSE, 
		row.names=NULL, skip=(params$loaderColNamesRow -1), nrows=1)
		cols<-colnames(inputDF)

## create a Name to RowId map for samples
keywords <- as.vector(colnames(inputDF)[-1])
queryName=params$sampleSetName

keywordMap<- getLookupMap( keywords, baseUrl=baseUrl, folderPath=params$containerPath, 
		schemaName="Samples", queryName=queryName, keyField="rowId", 
		displayField="SampleId")

doRunLoad(params=params, inputColNames=cols, outputColNames=c( "ID_REF", "sample", "val"), 
		lookupMap=keywordMap)

ExampleUtils.R, function getRunPropsList()

getRunPropsList<- function(rpPath, baseUrl) 
{
	rpIn<- read.table(rpPath,  col.names=c("name", "val1", "val2", "val3"),              #########
		header=FALSE, check.names=FALSE,                                             ##  1  ##  
		stringsAsFactors=FALSE, sep="\t", quote="", fill=TRUE, na.strings="");       ######### 

	## pull out the run properties

	params<- list(inputPathUploadedFile = rpIn$val1[rpIn$name=="runDataUploadedFile"],
		inputPathValidated = rpIn$val1[rpIn$name=="runDataFile"],
		
		##a little strange.  AssayRunTSVData is the one we need to output to
		outputPath = rpIn$val3[rpIn$name=="runDataFile"],
	
		containerPath = rpIn$val1[rpIn$name=="containerPath"], 
		runPropsOutputPath = rpIn$val1[rpIn$name=="transformedRunPropertiesFile"],
		sampleSetId = as.integer(rpIn$val1[rpIn$name=="sampleSet"]),
		probeSourceId = as.integer(rpIn$val1[rpIn$name=="probeSource"]),
		errorsFile = rpIn$val1[rpIn$name=="errorsFile"])

	## lookup the name of the sample set based on its number
	if (length(params$sampleSetId)>0)	
	{	
		df<-labkey.selectRows(baseUrl=baseUrl, 
			folderPath=params$containerPath, schemaName="exp", queryName="SampleSets", 
			colFilter=makeFilter(c("rowid", "EQUALS", params$sampleSetId)))
		params<- c(params, list(sampleSetName=df$Name))		
	}

	## This script reformats the rows in batches of 1000 in order to reduce 
        ## the memory requirements of the R calculations
	params<-c(params, list(loaderBatchSize=as.integer(1000)))
	
	## From the probesource lookup table, get the prefix characters that 
	## identify property value comment lines in the data file, and the starting 
	## line number of the spot data table within the data file
	dfProbeSource=labkey.selectRows(baseUrl=baseUrl, folderPath=params$containerPath,           #########
			schemaName="lists", queryName="probesources",                               ##  2  ##
			colFilter=makeFilter(c("probesourceid", "EQUALS", params$probeSourceId)))   #########	
	
	params<-c(params, list(propertyPrefix=dfProbeSource$propertyPrefix, 
			loaderColNamesRow=dfProbeSource$loaderColNamesRow))

	if (is.null(params$loaderColNamesRow) | is.na(params$loaderColNamesRow))
	{ 
		params$loaderColNamesRow <- 1 
	}
	
	## now apply the run property values reported in the header 
	## of the data tsv file to the corresponding run properties
	conInput = file(params$inputPathUploadedFile, "r")
	
	line<-""
	pfx <- as.integer(0)
	fHasProps <- as.logical(FALSE)
	
	if (!is.na(params$propertyPrefix))
	{                                           #########
		pfx<-nchar(params$propertyPrefix)   ##  3  ##
	}                                           #########
	
	while(pfx>0)
	{
		line<-readLines(conInput, 1)
		if (nchar(line)<=pfx) {break}
		if (substring(line, 1, pfx) != params$propertyPrefix) {break}
		strArray=strsplit(substring(line, pfx+1, nchar(line)) ,"\t", fixed=TRUE)
		prop<- strArray[[1]][1]
		val<- strArray[[1]][2]
		if (length(rpIn$name[rpIn$name==prop]) > 0 )
		{
			## dealing with dates is sometimes tricky.  You want the value pushed to rpIn 
			## to be a string representing a date but in the default date format  This data 
			## file uses a non-defualt date format that we explicitly convert to date using 
			## as.Date and a format string.
			## Then convert it back to character using the default format.
			
			if (rpIn$val2[rpIn$name==prop]=="java.util.Date")
			{
				val<-as.character(as.Date(val,  "%b%d%y"))
			}
			rpIn$val1[rpIn$name==prop]<-val
			fHasProps <- TRUE
		}				
	} 
	
	if (fHasProps)
	{
		## write out the transformed run properties to the file that 
		## the assay framework will read in
		write.table(rpIn, file=params$runPropsOutputPath, sep="\t", quote=FALSE
			, na="" , row.names=FALSE, col.names=FALSE, append=FALSE)
	}
	return (params)

}

getLookupMap()

getLookupMap<- function(uniqueLookupValues, baseUrl, folderPath, schemaName, 
queryName, keyField, displayField, otherColName=NULL, otherColValue=NULL)
{
	inClauseVals = paste(uniqueLookupValues, collapse=";")                #########
	colfilt<-makeFilter(c(displayField, "EQUALS_ONE_OF", inClauseVals))   ##  4  ##
	if (!is.null(otherColName))                                           #########
	{
		otherFilter=makeFilter(c(otherColName, "EQUALS", otherColValue))
		colfilt = c(colfilt, otherFilter)
	}
	colsel<- paste(keyField, displayField, sep=",")
	
	lookupMap <-labkey.selectRows(baseUrl=baseUrl, folderPath=folderPath,
 			schemaName=schemaName, queryName=queryName,
			colSelect=colsel,  colFilter=colfilt, showHidden=TRUE)
	
	newLookups<- uniqueLookupValues[!(uniqueLookupValues %in% lookupMap[,2])]

	if (length(newLookups)>0 && !is.na(newLookups[1]) )
	{
		## insert the lookup values that we haven't already seen before 
		newLookupsToInsert<- data.frame(lookupValue=newLookups, stringsAsFactors=FALSE)
		colnames(newLookupsToInsert)<- displayField
		if (!is.null(otherColName))
		{
			newLookupsToInsert<-cbind(newLookupsToInsert, otherColValue)
			colnames(newLookupsToInsert)<- c(displayField, otherColName)
		}
		
		result<- labkey.insertRows(baseUrl=baseUrl, folderPath=folderPath, 
			schemaName=schemaName, queryName=queryName, toInsert= newLookupsToInsert)

		lookupMap <-labkey.selectRows(baseUrl=baseUrl, folderPath=folderPath, 
			schemaName=schemaName, queryName=queryName,
			colSelect=colsel,  colFilter=colfilt, showHidden=TRUE)
	}
	colnames(lookupMap)<- c("RowId", "Name")
	
	return(lookupMap)
}

doRunLoad()

doRunLoad<-function(params, inputColNames, outputColNames, lookupMap)
{
	folder=params$containerPath		
	unlink(params$outputPath)

	cIn <- file(params$inputPathUploadedFile, "r")
	cOut<- file(params$outputPath , "w")

	## write the column headers to the output file	
	headerDF<-data.frame(matrix(NA, nrow=0, ncol=length(outputColNames)))	
	colnames(headerDF)<- outputColNames	

	write.table(headerDF, file=cOut, sep="\t", quote=FALSE, row.names=FALSE, na="", 
			col.names=TRUE, append=FALSE)
		
	# the fisrt read from the input file skips rows up to and including the header
	skipCnt<-params$loaderColNamesRow 
	
	## read in chunks of batchSize, which are then transposed and written to the output file.  #########
	## blkStart is the 1-based index of the starting row of a chunk                            ##  5  ##
	                                                                                           #########
	blkStart <- skipCnt + 1	
	rowsToRead <- params$loaderBatchSize

	while(rowsToRead > 0) 
	{ 		
		inputDF <- read.table(file=cIn, header = FALSE, sep = "\t", quote = "\"",	
			na.strings = "---", fill=TRUE, row.names=NULL, 
			stringsAsFactors = FALSE, check.names=FALSE,
			col.names=inputColNames ,skip=skipCnt, nrows=rowsToRead)

		cols<-colnames(inputDF)
		
		if(NROW(inputDF) >0)
		{	
			idVarName<-inputColNames[1]
			df1 <- reshape(inputDF, direction="long", idvar=idVarName,, 
				v.names="Val",timevar="Name"
                                 ,times=cols[-1], varying=list(cols[-1])        )              #########
                                                                                               ##  6  ##
                        df2<- merge(df1, lookupMap)                                            ######### 
			reshapedRows<- data.frame(cbind(df2[,idVarName], df2[,"RowId"], 
				df2[,"Val"], params$probeSourceId ), stringsAsFactors=FALSE)
			
			reshapedRows[,2] <- as.integer(reshapedRows[,2])
			reshapedRows[,4] <- as.integer(reshapedRows[,4])
	
			nonEmptyRows<- !is.na(reshapedRows[,3])
			reshapedRows<-reshapedRows[nonEmptyRows ,]

			reshapedRows<- reshapedRows[ do.call(order, reshapedRows[1:2]), ]  
			colnames(reshapedRows)<- outputColNames
			
			## need to double up the single quotes in the data 			
			reshapedRows[,3]<-gsub("'", "''", reshapedRows[,3],fixed=TRUE)
						
			write.table(reshapedRows, file=cOut, sep="\t", quote=TRUE, na="" , 
					row.names=FALSE, col.names=FALSE, append=TRUE)

			df1<-NULL
			df2<-NULL
			reshapedRows<-NULL
			recordsToInsert<-NULL

		}
		
		if (NROW(inputDF)< rowsToRead)
		{
			##we've hit the end of the file, no more to read
			rowsToRead <- 0
		}
		else
		{
		## now look where the next block will start, and read up to the end row
			blkStart <- blkStart + rowsToRead
		}	
		## skip rows only on the first read
		skipCnt<-0
	}
	inputDF<-NULL
	close(cIn)
	close(cOut)
}

LabKey Support

LabKey Support