Import From a Dataset Archive

2024-03-28

You can simultaneously import several datasets by composing a dataset archive. Using an archive and the LabKey data pipeline gives administrators flexibility about loading datasets from various locations. The archive format also contains a set of properties to control how the import is performed.

Configure Pipeline

To define the location from which the pipeline will process files, follow the instructions in this topic: Set a Pipeline Override. You may use the standard pipeline root or a pipeline override allows you to load files from the location of your choosing.

Create a Pipeline Configuration File

A pipeline configuration file controls the operation of the pipeline job. For dataset archives, the configuration file is named with the .dataset extension and contains a set of property/value pairs.

The configuration file specifies how the data should be handled on import. For example, you can indicate whether existing data should be replaced, deleted, or appended-to. You can also specify how to map data files to datasets using file names or a file pattern. The pipeline will then handle importing the data into the appropriate dataset(s).

Note that we automatically alias the names ptid, visit, dfcreate, and dfmodify to participantid, sequencenum, created, and modified respectively.

Dataset Archive File Format

Each line of a dataset archive contains one property-value pair, where the string to the left of the '=' is the property and the string to the right is the value. The first part of the property name is the id of the dataset to import. In our example the dataset id shown is '1'. The dataset id is always an integer.

The remainder of the property name is used to configure some aspect of the import operation. Each valid property is described in the following section.

The following example shows a simple .dataset file:

1.action=REPLACE
1.deleteAfterImport=FALSE

# map a source tsv column (right side) to a property name or full propertyURI (left)
1.property.ParticipantId=ptid
1.property.SiteId=siteid
1.property.VisitId=visit
1.property.Created=dfcreate

In addition to defining per-dataset properties, you can use the .dataset file to configure default property settings. Use the "default" keyword in the place of the dataset id. For example:

default.property.SiteId=siteid

Also, the "participant" keyword can be used to import a tsv into the participant table using a syntax similar to the dataset syntax. For example:

participant.file=005.tsv
participant.property.SiteId=siteId

Properties

The properties and their valid values are described below.

action

This property determines what happens to existing data when the new data is imported. The valid values are REPLACE, APPEND, DELETE. DELETE deletes the existing data without importing any new data. APPEND leaves the existing data and appends the new data. As always, you must be careful to avoid importing duplicate rows (action=MERGE would be helpful, but is not yet supported). REPLACE will first delete all the existing data before importing the new data. REPLACE is the default.

enrollment.action=REPLACE

deleteAfterImport

This property specifies that the source .tsv file should be deleted after the data is successfully imported. The valid values are TRUE or FALSE. The default is FALSE.

enrollment.deleteAfterImport=TRUE

file

This property specifies the name of the tsv (tab-separated values) file which contains the data for the named dataset. This property does not apply to the default dataset. In this example, the file enrollment.tsv contains the data to be imported into the enrollment dataset.

enrollment.file=enrollment.tsv

filePattern

This property applies to the default dataset only. If your dataset files are named consistently, you can use this property to specify how to find the appropriate dataset to match with each file. For instance, assume your data is stored in files with names like plate###.tsv, where ### corresponds to the appropriate DatasetId. In this case you could use the file pattern "plate(\d\d\d).tsv". Files will then be matched against this pattern, so you do not need to configure the source file for each dataset individually. If your files are defined with names like dataset###.tsv, where ### corresponds to the dataset name, you can use the following file pattern "dataset(\w*).tsv".

default.filePattern=plate(\d\d\d).tsv

property

If the column names in the tsv data file do not match the dataset property names, the property property can be used to map columns in the .tsv file to dataset properties. This mapping works for both user-defined and built-in properties. Assume that the ParticipantId value should be loaded from the column labeled ptid in the data file. The following line specifies this mapping:

enrollment.property.ParticipantId=ptid

Note that each dataset property may be specified only once on the left side of the equals sign, and each .tsv file column may be specified only once on the right.

sitelookup

This property applies to the participant dataset only. Upon importing the particpant dataset, the user typically will not know the LabKey internal code of each site. Therefore, one of the other unique columns from the sites must be used. The sitelookup property indicates which column is being used. For instance, to specify a site by name, use participant.sitelookup=label. The possible columns are label, rowid, ldmslabcode, labwarelabcode, and labuploadcode. Note that internal users may use scharpid as well, though that column name may not be supported indefinitely.

Participant Dataset

The virtual participant dataset is used as a way to import site information associated with a participant. This dataset has three columns in it: ParticipantId, EnrollmentSiteId, and CurrentSiteId. ParticipantId is required, while EnrollmentSiteId and CurrentSiteId are both optional.

As described above, you can use the sitelookup property to import a value from one of the other columns in this table. If any of the imported value are ambiguous, the import will fail.

Related Topics