Print of Import Data to a Dataset and Descendants

guest
2025-05-18

Import Data to a Dataset
Import From a Dataset Archive

Import Data to a Dataset

This topic describes how to import data to an existing dataset in a LabKey study. Authorized users can add data a row at a time, or import data from spreadsheets or other data files. Imported data must match the structure of the dataset into which it is imported.

The following import methods are available:

Insert Single Row - Appends a single row of data to the dataset.
Bulk Import from File - Append or merge multiple rows from a file.
Bulk Import via Copy-and-Paste - Append or merge multiple rows from a Copy-and-Paste.
Update and Merge - For any bulk import method, existing rows may be updated.

Insert Single Row

Navigate to the dataset of interest.
Select (Insert Data) > Insert New Row above the dataset grid.
Enter data for the fields.

Fields marked with '*' are required.
Hover over the '?' for any field to see more information, such as the type expected.
If an administrator has limited input to a set of values, you will see a dropdown menu and can type ahead to narrow the list of options.

When finished, click Submit.

Import from File

Navigate to the dataset of interest.
Select (Insert Data) > Import Bulk Data.
To confirm that the data structures will match, click Download Template to obtain an empty spreadsheet containing all of the fields defined. Fill in your data, or compare to your existing spreadsheet and adjust until the spreadsheet format will match the dataset.
By default you will Add rows. Any data for existing rows will cause the import to fail. See update and merge options below.
To import the data from the file, click the Upload file panel to open it.

Click Browse or Choose File to select and upload the .xlsx, .xls, .csv, or .txt file.
Use the Import Lookups by Alternate Key if you want lookups resolved by something other than the primary key.

Click Submit.

Import via Copy/Paste

Navigate to the dataset of interest.
Select (Insert Data) > Import Bulk Data.
To confirm that the data structures will match, click Download Template to obtain an empty spreadsheet containing all of the fields defined. Fill in your data, or compare to your existing spreadsheet and adjust until the spreadsheet format will match the dataset.
If you want to update data for existing rows, select Update rows. To merge both updates and new rows, check the box: Allow new rows during update. Learn more below.
To provide the data either:

Click Upload File to select and upload the .xlsx, .xls, .csv, or .txt file.
Copy and paste your tabular data into the Copy/paste text box. Select the Format (either tab-separated or comma-separated).
Use the Import Lookups by Alternate Key if you want lookups resolved by something other than the primary key.

Click Submit.

Dataset Update and Merge

For bulk imports, you can select Update rows to update existing data rows during import. By default, any data for new rows will cause the update to fail.

Check the box to Allow new rows during update to merge both existing and new rows.

If Update rows and Allow new rows during update are both selected:

Data will be updated for existing row identifiers, and new rows will be created for any that do not match.
Row identifiers are determined by how the dataset is keyed, specified under Data Row Uniqueness in the dataset design.

Note that for a dataset with a system-managed additional key, no row updates will occur, as this key is present specifically to ensure all rows are unique when new data is imported.

For rows being updated, each column in the import file (or pasted data) will be checked. If the import contains a value, the existing field in the row will be replaced with the new value.

When datasets are updated, "smart diffing" identifies and imports only the changed rows. In some cases this can result in considerable performance improvement. Only the minimal set of necessary records are updated, and the audit log will record only the rows and fields that have changed. Both values, "before" and "after" the update are recorded in the detailed audit log.

Premium Resource Available

Subscribers to premium editions of LabKey Server can learn about adding a Bulk Edit button with the example code in this topic:

Premium Resource: Bulk Editing

Note that SequenceNum values cannot be bulk edited, as they are used in aligning subject and visit uniqueness for datasets.

Learn more about premium editions

Import From a Dataset Archive

You can simultaneously import several datasets by composing a dataset archive. Using an archive and the LabKey data pipeline gives administrators flexibility about loading datasets from various locations. The archive format also contains a set of properties to control how the import is performed.

Configure Pipeline
Create a Pipeline Configuration File
Dataset Archive File Format
Properties
Participant Dataset

Configure Pipeline

To define the location from which the pipeline will process files, follow the instructions in this topic: Set a Pipeline Override. You may use the standard pipeline root or a pipeline override allows you to load files from the location of your choosing.

Create a Pipeline Configuration File

A pipeline configuration file controls the operation of the pipeline job. For dataset archives, the configuration file is named with the .dataset extension and contains a set of property/value pairs.

The configuration file specifies how the data should be handled on import. For example, you can indicate whether existing data should be replaced, deleted, or appended-to. You can also specify how to map data files to datasets using file names or a file pattern. The pipeline will then handle importing the data into the appropriate dataset(s).

Note that we automatically alias the names ptid, visit, dfcreate, and dfmodify to participantid, sequencenum, created, and modified respectively.

Dataset Archive File Format

Each line of a dataset archive contains one property-value pair, where the string to the left of the '=' is the property and the string to the right is the value. The first part of the property name is the id of the dataset to import. In our example the dataset id shown is '1'. The dataset id is always an integer.

The remainder of the property name is used to configure some aspect of the import operation. Each valid property is described in the following section.

The following example shows a simple .dataset file:

1.action=REPLACE
1.deleteAfterImport=FALSE

# map a source tsv column (right side) to a property name or full propertyURI (left)
1.property.ParticipantId=ptid
1.property.SiteId=siteid
1.property.VisitId=visit
1.property.Created=dfcreate

In addition to defining per-dataset properties, you can use the .dataset file to configure default property settings. Use the "default" keyword in the place of the dataset id. For example:

default.property.SiteId=siteid

Also, the "participant" keyword can be used to import a tsv into the participant table using a syntax similar to the dataset syntax. For example:

participant.file=005.tsv
participant.property.SiteId=siteId

Properties

The properties and their valid values are described below.

action

This property determines what happens to existing data when the new data is imported. The valid values are REPLACE, APPEND, DELETE. DELETE deletes the existing data without importing any new data. APPEND leaves the existing data and appends the new data. As always, you must be careful to avoid importing duplicate rows (action=MERGE would be helpful, but is not yet supported). REPLACE will first delete all the existing data before importing the new data. REPLACE is the default.

enrollment.action=REPLACE

deleteAfterImport

This property specifies that the source .tsv file should be deleted after the data is successfully imported. The valid values are TRUE or FALSE. The default is FALSE.

enrollment.deleteAfterImport=TRUE

file

This property specifies the name of the tsv (tab-separated values) file which contains the data for the named dataset. This property does not apply to the default dataset. In this example, the file enrollment.tsv contains the data to be imported into the enrollment dataset.

enrollment.file=enrollment.tsv

filePattern

This property applies to the default dataset only. If your dataset files are named consistently, you can use this property to specify how to find the appropriate dataset to match with each file. For instance, assume your data is stored in files with names like plate###.tsv, where ### corresponds to the appropriate DatasetId. In this case you could use the file pattern "plate(\d\d\d).tsv". Files will then be matched against this pattern, so you do not need to configure the source file for each dataset individually. If your files are defined with names like dataset###.tsv, where ### corresponds to the dataset name, you can use the following file pattern "dataset(\w*).tsv".

default.filePattern=plate(\d\d\d).tsv

property

If the column names in the tsv data file do not match the dataset property names, the property property can be used to map columns in the .tsv file to dataset properties. This mapping works for both user-defined and built-in properties. Assume that the ParticipantId value should be loaded from the column labeled ptid in the data file. The following line specifies this mapping:

enrollment.property.ParticipantId=ptid

Note that each dataset property may be specified only once on the left side of the equals sign, and each .tsv file column may be specified only once on the right.

sitelookup

This property applies to the participant dataset only. Upon importing the particpant dataset, the user typically will not know the LabKey internal code of each site. Therefore, one of the other unique columns from the sites must be used. The sitelookup property indicates which column is being used. For instance, to specify a site by name, use participant.sitelookup=label. The possible columns are label, rowid, ldmslabcode, labwarelabcode, and labuploadcode. Note that internal users may use scharpid as well, though that column name may not be supported indefinitely.

Participant Dataset

The virtual participant dataset is used as a way to import site information associated with a participant. This dataset has three columns in it: ParticipantId, EnrollmentSiteId, and CurrentSiteId. ParticipantId is required, while EnrollmentSiteId and CurrentSiteId are both optional.

As described above, you can use the sitelookup property to import a value from one of the other columns in this table. If any of the imported value are ambiguous, the import will fail.

Table of Contents