Preparing Data for Import

Documentation
This topic explains how to best prepare your data for import into each of the data structures in order to meet any constraints set up by the target structure.

LabKey Server provides a variety of different data structures for different uses: Assay Designs for capturing instrument data, Datasets for integrating heterogeneous clinical data, Lists for general tabular data, etc. Some of these data structures place strong constraints on the nature of the data to imported, for example Datasets make numerous uniqueness constraints on the data; other data structures, such as Sample Sets, make few assumptions about incoming data.

General Advice: Avoid Mixed Data Types in a Column

LabKey tables (Lists, Datasets, etc.) are implemented as database tables. So your data should be prepared for insertion into a database. Most importantly, each column should conform to a database data type, such as String, Integer, Decimal Number, etc. Mixed data in a column will be rejected when you try to upload it.

Wrong

The following table mixes Boolean and String data in a single column.

ParticipantIdPreexisting Condition
P-100True, Edema
P-200False
P-300True, Anemia

Right

Split out the mixed data into separate columns

ParticipantIdPreexisting ConditionCondition Name
P-100TrueEdema
P-200False 
P-300TrueAnemia

General Advice: Avoid Special Characters in Column Headers

Column names should avoid special characters such as !, @, #, $, etc. Column names should contain only letters, numbers, spaces, and underscores; and should begin only with a letter or underscore. We also recommend underscores instead of spaces.

Wrong

The following table has special characters in the column names.

Participant #Preexisting Condition?
P-100True
P-200False
P-300True

Right

The following table removes the special characters and replaces spaces with underscores.

Participant_NumberPreexisting_Condition
P-100True
P-200False
P-300True

Clinical Datasets

Datasets are intended to capture measurements events on some subject, like a blood pressure measurement or a viral count at some point it time. So datasets are required to have two columns:

  • a subject id
  • a time point (either in the form of a date or a number)
Also, a subject cannot have two different blood pressure readings at a given point in time, so datasets reflect this fact by having uniqueness constraints: each record in a dataset must have a unique combination of subject id plus a time point.

Wrong

The following dataset has duplicate subject id / timepoint combinations.

ParticipantIdDateSystolicBloodPressure
P-1001/1/2000120
P-1001/1/2000105
P-1002/2/2000110
P-2001/1/200090
P-2002/2/200095

Right

The following table removes the duplicate row.

ParticipantIdDateSystolicBloodPressure
P-1001/1/2000120
P-1002/2/2000110
P-2001/1/200090
P-2002/2/200095

Demographic Datasets

Demographic datasets have all of the constraints of clinical datasets, plus one more: a given subject identifier cannot appear twice in a demographic dataset.

Wrong

The following demographic dataset has a duplicate subject id.

ParticipantIdDateGender
P-1001/1/2000M
P-1001/1/2000M
P-2001/1/2000F
P-3001/1/2000F
P-4001/1/2000M

Right

The following table removes the duplicate row.

ParticipantIdDateGender
P-1001/1/2000M
P-2001/1/2000F
P-3001/1/2000F
P-4001/1/2000M

Specimens

When importing data into the study module's specimen repository, three fields are required.

NameData TypeDescription
global_unique_specimen_idStringThe unique ID of each vial.
ptidStringThe ID of the subject providing each specimen.
visit_value or DateNumber(Double) or DateTimeThe visit number or date associated with the sample.

Right

An example specimen table.

global_unique_specimen_idvisit_valueptidvolumevolume_units
6396321P-100100mL
4393032P-100100mL
5524313P-100100mL

Related Topics

Discussion

previousnext
 
expand all collapse all