You can simultaneously import several datasets by composing a
dataset archive. Using an archive and the
LabKey data pipeline gives administrators flexibility about loading datasets from various locations. The archive format also contains a set of properties to control how the import is performed.
Configure Pipeline
To define the location from which the pipeline will process files, follow the instructions in this topic:
Set a Pipeline Override. You may use the standard pipeline root or a pipeline override allows you to load files from the location of your choosing.
Create a Pipeline Configuration File
A
pipeline configuration file controls the operation of the pipeline job. For dataset archives, the configuration file is named with the
.dataset extension and contains a set of property/value pairs.
The configuration file specifies how the data should be handled on import. For example, you can indicate whether existing data should be replaced, deleted, or appended-to. You can also specify how to map data files to datasets using file names or a file pattern. The pipeline will then handle importing the data into the appropriate dataset(s).
Note that we automatically alias the names ptid, visit, dfcreate, and dfmodify to participantid, sequencenum, created, and modified respectively.
Dataset Archive File Format
Each line of a dataset archive contains one property-value pair, where the string to the left of the '=' is the property and the string to the right is the value. The first part of the property name is the id of the dataset to import. In our example the dataset id shown is '1'. The dataset id is always an integer.
The remainder of the property name is used to configure some aspect of the import operation. Each valid property is described in the
following section.
The following example shows a simple .dataset file:
1.action=REPLACE
1.deleteAfterImport=FALSE
# map a source tsv column (right side) to a property name or full propertyURI (left)
1.property.ParticipantId=ptid
1.property.SiteId=siteid
1.property.VisitId=visit
1.property.Created=dfcreate
In addition to defining per-dataset properties, you can use the .dataset file to configure default property settings. Use the "default" keyword in the place of the dataset id. For example:
default.property.SiteId=siteid
Also, the "participant" keyword can be used to import a tsv into the participant table using a syntax similar to the dataset syntax. For example:
participant.file=005.tsv
participant.property.SiteId=siteId
Properties
The properties and their valid values are described below.
actionThis property determines what happens to existing data when the new data is imported. The valid values are REPLACE, APPEND, DELETE.
DELETE deletes the existing data without importing any new data. APPEND leaves the existing data and appends the new data. As always, you must be careful to avoid importing duplicate rows (action=MERGE would be helpful, but is not yet supported). REPLACE will first delete all the existing data before importing the new data. REPLACE is the default.
enrollment.action=REPLACEdeleteAfterImportThis property specifies that the source .tsv file should be deleted after the data is successfully imported. The valid values are TRUE or FALSE. The default is FALSE.
enrollment.deleteAfterImport=TRUEfileThis property specifies the name of the tsv (tab-separated values) file which contains the data for the named dataset. This property does not apply to the
default dataset. In this example, the file
enrollment.tsv contains the data to be imported into the
enrollment dataset.
enrollment.file=enrollment.tsvfilePatternThis property applies to the
default dataset only. If your dataset files are named consistently, you can use this property to specify how to find the appropriate dataset to match with each file. For instance, assume your data is stored in files with names like
plate###.tsv, where
### corresponds to the appropriate DatasetId. In this case you could use the file pattern "plate(\d\d\d).tsv". Files will then be matched against this pattern, so you do not need to configure the source file for each dataset individually. If your files are defined with names like
dataset###.tsv, where
### corresponds to the dataset name, you can use the following file pattern "dataset(\w*).tsv".
default.filePattern=plate(\d\d\d).tsvpropertyIf the column names in the tsv data file do not match the dataset property names, the
property property can be used to map columns in the .tsv file to dataset properties. This mapping works for both user-defined and built-in properties. Assume that the ParticipantId value should be loaded from the column labeled
ptid in the data file. The following line specifies this mapping:
enrollment.property.ParticipantId=ptidNote that each dataset property may be specified only once on the left side of the equals sign, and each .tsv file column may be specified only once on the right.
sitelookupThis property applies to the participant dataset only. Upon importing the particpant dataset, the user typically will not know the LabKey internal code of each site. Therefore, one of the other unique columns from the sites must be used. The sitelookup property indicates which column is being used. For instance, to specify a site by name, use
participant.sitelookup=label. The possible columns are
label, rowid, ldmslabcode, labwarelabcode, and
labuploadcode. Note that internal users may use
scharpid as well, though that column name may not be supported indefinitely.
Participant Dataset
The virtual participant dataset is used as a way to import site information associated with a participant. This dataset has three columns in it: ParticipantId, EnrollmentSiteId, and CurrentSiteId. ParticipantId is required, while EnrollmentSiteId and CurrentSiteId are both optional.
As described above, you can use the
sitelookup property to import a value from one of the other columns in this table. If any of the imported value are ambiguous, the import will fail.
Related Topics