Create a Pipeline Configuration File

To control the operation of the dataset import, you can create a pipeline configuration file. The configuration file for dataset import is named with the .dataset extension and contains a set of property/value pairs.

The configuration file specifies how the data should be handled on import. For example, you can indicate whether existing data should be replaced, deleted, or appended to when new data is imported into the same named dataset. You can also specify how to map data files to datasets using file names or a file pattern. The pipeline will then handle importing the data into the appropriate dataset.

Note that we automatically alias the names ptid, visit, dfcreate, and dfmodify to participantid, sequencenum, created, and modified.

File Format

The following example shows a simple .dataset file:


# map a source tsv column (right side) to a property name or full propertyURI (left)
Each line contains one property-value pair, where the string to the left of the '=' is the property and the string to the right is the value. The first part of the property name is the id of the dataset to import. In this example the dataset id is '1'. The dataset id is always an integer.

The remainder of the property name is used to configure some aspect of the import operation. Each valid property is described in the following section.

In addition to defining per-dataset properties, you can use the .dataset file to configure default property settings. Use the "default" keyword in the place of the dataset id. For example:

Also, the "participant" keyword can be used to import a tsv into the participant table using a syntax similar to the dataset syntax. For example:



The properties and their valid values are described below.


This property determines what happens to existing data when the new data is imported. The valid values are REPLACE, APPEND, DELETE. DELETE deletes the existing data without importing any new data. APPEND leaves the existing data and appends the new data. As always, you must be careful to avoid importing duplicate rows (action=MERGE would be helpful, but is not yet supported). REPLACE will first delete all the existing data before importing the new data. REPLACE is the default.



This property specifies that the source .tsv file should be deleted after the data is successfully imported. The valid values are TRUE or FALSE. The default is FALSE.



This property specifies the name of the tsv (tab-separated values) file which contains the data for the named dataset. This property does not apply to the default dataset. In this example, the file enrollment.tsv contains the data to be imported into the enrollment dataset.



This property applies to the default dataset only. If your dataset files are named consistently, you can use this property to specify how to find the appropriate dataset to match with each file. For instance, assume your data is stored in files with names like plate###.tsv, where ### corresponds to the appropriate DatasetId. In this case you could use the file pattern "plate(\d\d\d).tsv". Files will then be matched against this pattern, so you do not need to configure the source file for each dataset individually.



If the column names in the tsv data file do not match the dataset property names, the property property can be used to map columns in the .tsv file to dataset properties. This mapping works for both user-defined and built-in properties. Assume that the ParticipantId value should be loaded from the column labeled ptid in the data file. The following line specifies this mapping:

Note that each dataset property may be specified only once on the left side of the equals sign, and each .tsv file column may be specified only once on the right.


This property applies to the participant dataset only. Upon importing the particpant dataset, the user typically will not know the LabKey internal code of each site. Therefore, one of the other unique columns from the sites must be used. The sitelookup property indicates which column is being used. For instance, to specify a site by name, use participant.sitelookup=label. The possible columns are label, rowid, ldmslabcode, labwarelabcode, and labuploadcode. Note that internal users may use scharpid as well, though that column name may not be supported indefinitely.

Participant Dataset

The virtual participant dataset is used as a way to import site information associated with a participant. This dataset has three columns in it: ParticipantId, EnrollmentSiteId, and CurrentSiteId. ParticipantId is required, while EnrollmentSiteId and CurrentSiteId are both optional.

As described above, you can use the sitelookup property to import a value from one of the other columns in this table. If any of the imported value are ambiguous, the import will fail.





expand all collapse all