This topic explains how to best prepare your data for import so you can meet any requirements set up by the target data structure.
LabKey Server provides a variety of different data structures for different uses: Assay Designs for capturing instrument data, Datasets for integrating heterogeneous clinical data, Lists for general tabular data, etc. Some of these data structures place strong constraints on the nature of the data to be imported, for example Datasets make uniqueness constraints on the data; other data structures, such as Lists, make few assumptions about incoming data.
General Advice: Avoid Mixed Data Types in a Column
LabKey tables (Lists, Datasets, etc.) are implemented as
database tables. So your data should be prepared for insertion into a database. Most importantly, each column should conform to a database data type, such as String, Integer, Decimal Number, etc. Mixed data in a column will be rejected when you try to upload it.
Wrong
The following table mixes Boolean and String data in a single column.
ParticipantId | Preexisting Condition |
---|
P-100 | True, Edema |
P-200 | False |
P-300 | True, Anemia |
Right
Split out the mixed data into separate columns
ParticipantId | Preexisting Condition | Condition Name |
---|
P-100 | True | Edema |
P-200 | False | |
P-300 | True | Anemia |
General Advice: Avoid Special Characters in Column Headers
Column names should avoid special characters such as !, @, #, $, etc.
Column names should contain only letters, numbers, spaces, and underscores; and should begin only with a letter or underscore. We also recommend underscores instead of spaces.
Wrong
The following table has special characters in the column names.
Participant # | Preexisting Condition? |
---|
P-100 | True |
P-200 | False |
P-300 | True |
Right
The following table removes the special characters and replaces spaces with underscores.
Participant_Number | Preexisting_Condition |
---|
P-100 | True |
P-200 | False |
P-300 | True |
Data Aliasing
Use data aliasing to work with non-conforming data, meaning the provided data has different columns names or different value ids for the same underlying thing. Examples include:
- A lab provides assay data which uses different participant ids than those used in your study. Using different participant ids is often desirable and intentional, as it provides a layer of PHI protection for the lab and the study.
- Excel files have different column names for the same data, for example some files have the column "Immune Rating" and other have the column "Immune Score". You can define an arbitrary number of these import aliases to map to the same column in LabKey.
- The source files have a variety of names for the same visit id, for example, "M1", "Milestone #1", and "Visit 1".
Clinical Dataset Details
Datasets are intended to capture measurements events on some subject, like a blood pressure measurement or a viral count at some point it time. So datasets are required to have two columns:
- a subject id
- a time point (either in the form of a date or a number)
Also, a subject cannot have two different blood pressure readings at a given point in time, so datasets reflect this fact by having uniqueness constraints: each record in a dataset must have a unique combination of subject id plus a time point.
Wrong
The following dataset has duplicate subject id / timepoint combinations.
ParticipantId | Date | SystolicBloodPressure |
---|
P-100 | 1/1/2000 | 120 |
P-100 | 1/1/2000 | 105 |
P-100 | 2/2/2000 | 110 |
P-200 | 1/1/2000 | 90 |
P-200 | 2/2/2000 | 95 |
Right
The following table removes the duplicate row.
ParticipantId | Date | SystolicBloodPressure |
---|
P-100 | 1/1/2000 | 120 |
P-100 | 2/2/2000 | 110 |
P-200 | 1/1/2000 | 90 |
P-200 | 2/2/2000 | 95 |
Demographic Dataset Details
Demographic datasets have all of the constraints of clinical datasets, plus one more: a given subject identifier cannot appear twice in a demographic dataset.
Wrong
The following demographic dataset has a duplicate subject id.
ParticipantId | Date | Gender |
---|
P-100 | 1/1/2000 | M |
P-100 | 1/1/2000 | M |
P-200 | 1/1/2000 | F |
P-300 | 1/1/2000 | F |
P-400 | 1/1/2000 | M |
Right
The following table removes the duplicate row.
ParticipantId | Date | Gender |
---|
P-100 | 1/1/2000 | M |
P-200 | 1/1/2000 | F |
P-300 | 1/1/2000 | F |
P-400 | 1/1/2000 | M |
Specimen Details
When importing data into the study module's specimen repository, three fields are required.
Name | Data Type | Description |
---|
global_unique_specimen_id | String | The unique ID of each vial. |
ptid | String | The ID of the subject providing each specimen. |
visit_value or Date | Number(Double) or DateTime | The visit number or date associated with the sample. |
Right
An example specimen table.
global_unique_specimen_id | ptid | visit_value | volume | volume_units |
---|
639632 | P-100 | 1 | 100 | mL |
439303 | P-100 | 2 | 100 | mL |
552431 | P-100 | 3 | 100 | mL |
Related Topics