Create and Populate Datasets: /Documentation/Archive/20.7

Create and Populate Datasets

A dataset contains related data values that are collected or measured as part of a cohort study. Datasets are keyed by both subject and time. For example, laboratory tests or information collected about a participant over time, where there are many rows per participant, but only one for each participant at each time.

A dataset's properties include identifiers, keys, and categorizations for the dataset. Its fields represent columns and establish the shape and content of the data "table". For example, a dataset for physical exams would typically include fields like height, weight, respiration rate, blood pressure, etc.

The set of fields ensure the upload of consistent data records by defining the acceptable types, and can also include validation and conditional formatting when necessary. There are system fields built in to any dataset, such as creation date, and because datasets are part of studies, they must also include columns that will map to participants and time.

This topic covers creating a dataset from within the study UI.

Create Dataset
Define Properties

Basic Properties
Data Row Uniqueness

Define Fields

Infer Fields from a File
Set Column Mapping
Import Data with Inferral

Manually Define Fields

Create Dataset

Navigate to the Manage tab of your study folder.
Click Manage Datasets.
Click Create New Dataset.

The following sections describe each panel within the creation wizard. If you later edit the dataset, you will return to these panels and be able to change most of the values.

Define Properties

The first panel defines Dataset Properties.
Enter the Basic Properties.
Data Row Uniqueness: Select the appropriate value for how this dataset is keyed.
Click Advanced Settings to control whether to show the dataset in the overview, manually set the dataset ID, associate this data with cohorts, and use tags as another way to categorize datasets. Learn more in this topic: Dataset Properties.

Continue to define Fields for your dataset before clicking Save.

Basic Properties

Name (Required): The dataset name is required and must be unique.
Label: By default, the dataset Name is shown to users. You can define a Label to use instead if desired.
Description: An optional short description of the dataset.
Category: Assigning a category to the dataset will group it with other data in that category when viewed in the data browser. By default, new datasets are uncategorized. Learn more about categories in this topic: Manage Categories.

The dropdown menu for this field will list currently defined categories. Select one, OR
Type a new category name to define it from here. Click Create option... that will appear in the dropdown menu to create and select it.

Data Row Uniqueness

Select how unique data rows in your dataset are determined:

Participants only (demographic data): There is one row per participant. For example, enrollment data.
Participants and timepoints/visits: There is (at most) one row per participant at each timepoint or visit. For example, physical exams would only have one measured value for a subject at a time, but there might be many rows for that subject: one for each time the value was measured.
Participants, timepoints, and additional key field: If there may be multiple rows for a participant/time combination, such as assay data from a series of specimens taken at an appointment, you would choose this option. When selected you will also specify:

Additional Key Field: Select the field that will uniquely identify rows alongside participant and time. In the above example, it might be specimen ID. You could also use the time portion of the date time field. Note that you must first define fields as described below before you can select them from the dropdown here.
Let server manage fields to make entries unique: Check the box if you would like the server to manage field values to ensure uniqueness. Numbers will be assigned auto-incrementing integer values; strings will be assigned globally unique identifiers (GUIDs).

Define Fields

Click the Fields panel to open it. LabKey can infer the fields from a sample spreadsheet, or you can define them manually.

Infer Fields from a File

The Fields panel opens on the Infer fields from file option. You can click within the box to select a file or simply drag it from your desktop. Supported formats include: .csv, .tsv, .txt, .xls, .xlsx.

LabKey will make a best guess effort to infer the names and types for all the columns in the spreadsheet. You will now see them in the fields editor that you would use to manually define fields as described below.

Make any adjustments needed.

For instance, if a numeric column happens to contain integer values, but should be of type "Decimal", make the change here.
If you want one of the inferred fields to be ignored, delete it by clicking the .
If any fields should be required, check the box in the Required column.

Before you click Save you have the option to import the data from the spreadsheet you used for inferral to the dataset you are creating. Otherwise, you will create an empty structure and can import data later.

Set Column Mapping

When you infer fields, you will need to confirm that the Column mapping section below the fields is correct.

Datasets must map to both study subjects (participants) and some sense of time (either dates or sequence numbers for visits). During field inferral, the server will make a guess at these mappings. Use the dropdowns to make changes if needed.

Import Data with Inferral

Near the bottom, you can use the selector to control whether to import data or not. By default, it is set to Import Data and you will see the first three rows of the file. Click Save to create and populate the dataset.

If you want to create the dataset without importing data, either click the or the selector itself. The file name and preview will disappear and the selector will read Don't Import. Click Save to create the empty dataset.

Adding data to a dataset is covered in the topic: Import Data to a Dataset.

Manually Define Fields

Instead of using a spreadsheet, you can click to manually define fields. Note that the two required fields are predefined: ParticipantID and Date (or SequenceNum for visit-based studies). You cannot add these fields when defining a dataset manually; you only add the other fields in the dataset.

Click Add Field for each field you need. Use the Data Type dropdown to select the type, and click the to expand field details to set properties.

If you add a field by mistake, click the to delete it.

After adding all your fields, click Save. You will now have an empty dataset and can import data to it.

LabKey Support

LabKey Support