Create and Populate Datasets: /Documentation

Create and Populate Datasets

A dataset contains related data values that are collected or measured as part of a cohort study to track participants over time. For example, laboratory tests run at a series of appointments would yield many rows per participant, but only one for each participant at each time.

A dataset's properties include identifiers, keys, and categorizations for the dataset. Its fields represent columns and establish the shape and content of the data "table". For example, a dataset for physical exams would typically include fields like height, weight, respiration rate, blood pressure, etc.

The set of fields ensure the upload of consistent data records by defining the acceptable types, and can also include validation and conditional formatting when necessary. There are system fields built in to any dataset, such as creation date, and because datasets are part of studies, they must also include columns that will map to participants and time.

This topic covers creating a dataset from within the study UI. You must have at least the folder administrator role to create a new dataset.

Create Dataset
Define Properties

Basic Properties
Data Row Uniqueness

Define Fields

Infer Fields from a File
Set Column Mapping
Import Data with Inferral

Export/Import Field Definitions
Manually Define Fields

Create Dataset

Navigate to the Manage tab of your study folder.
Click Manage Datasets.
Click Create New Dataset.

The following sections describe each panel within the creation wizard. If you later edit the dataset, you will return to these panels and be able to change most of the values.

Define Properties

The first panel defines Dataset Properties.
Enter the Basic Properties.
Data Row Uniqueness: Select how this dataset is keyed.
Click Advanced Settings to control whether to show the dataset in the overview, manually set the dataset ID, associate this data with cohorts, and use tags as another way to categorize datasets. Learn more in this topic: Dataset Properties.

Continue to define Fields for your dataset before clicking Save.

Basic Properties

Name (Required): The dataset name is required and must be unique.

The name must be unique, must start with a letter or number character, and cannot contain special characters or some reserved substrings listed here.
The dataset name also cannot match the name of any internal tables, including system tables like one created to hold all the study subjects which is given the name of the "Subject Noun Singular" for the study. For example, if your subject noun is "Pig" you cannot have a dataset named "Pig".

Label: By default, the dataset Name is shown to users. You can define a Label to use instead if desired.
Description: An optional short description of the dataset.
Category: Assigning a category to the dataset will group it with other data in that category when viewed in the data browser. By default, new datasets are uncategorized. Learn more about categories in this topic: Manage Categories.

The dropdown menu for this field will list currently defined categories. Select one, OR
Type a new category name to define it from here. Click Create option... that will appear in the dropdown menu to create and select it.

Data Row Uniqueness

Select how unique data rows in your dataset are determined:

Participants only (demographic data):

There is one row per participant.
A value for the participant field is always required.

Participants and timepoints/visits:

There is (at most) one row per participant at each timepoint or visit.
Both participant and date (or visit) values must be provided.

Participants, timepoints, and additional key field:

There may be multiple rows for a participant/time combination, requiring an additional key field to ensure unique rows.
Learn more in this topic: Dataset Properties
Note that when using an additional key, you will temporarily see an error in the UI until you create the necessary field to select as a key in the next section.

Define Fields

Click the Fields panel to open it. You can define fields for a new dataset in several ways:

LabKey can infer the fields from an example data spreadsheet you upload
You can import a set of field definitions in a JSON file
You can define them manually

Infer Fields from a File

The Fields panel opens on the Import or infer fields from file option. You can click within the box to select a file or simply drag it from your desktop.

Supported data file formats include: .csv, .tsv, .txt, .xls, .xlsx.

LabKey will make a best guess effort to infer the names and types for all the columns in the spreadsheet.

You will now see them in the fields editor that you would use to manually define fields as described below.

Note that if your file includes columns for reserved fields, they will not be shown as inferred. Reserved fields will always be created for you.

Make any adjustments needed.

For instance, if a numeric column happens to contain integer values, but should be of type "Decimal", make the change here.
If any field names include special characters (including spaces) you should adjust the inferral to give the field a more 'basic' name and move the original name to the Label and Import Aliases field properties. For example, if your data includes a field named "CD4+ (cells/mm3)", you would put that string in both Label and Import Aliases but name the field "CD4" for best results.
If you want one of the inferred fields to be ignored, delete it by clicking the .
If any fields should be required, check the box in the Required column.

Before you click Save you have the option to import the data from the spreadsheet you used for inferral to the dataset you are creating. Otherwise, you will create an empty structure and can import data later.

Set Column Mapping

When you infer fields, you will need to confirm that the Column mapping section below the fields is correct.

All datasets must map to study subjects (participants) and non-demographic datasets must map to some sense of time (either dates or sequence numbers for visits). During field inferral, the server will make a guess at these mappings. Use the dropdowns to make changes if needed.

Import Data with Inferral

Near the bottom, you can use the selector to control whether to import data or not. By default, it is set to Import Data and you will see the first three rows of the file. Click Save to create and populate the dataset.

If you want to create the dataset without importing data, either click the or the selector itself. The file name and preview will disappear and the selector will read Don't Import. Click Save to create the empty dataset.

Adding data to a dataset is covered in the topic: Import Data to a Dataset.

Export/Import Field Definitions

In the top bar of the list of fields, you see an Export button. You can click to export field definitions in a JSON format file. This file can be used to create the same field definitions in another list, either as is or with changes made offline.

To import a JSON file of field definitions, use the infer from file method, selecting the .fields.json file instead of a data-bearing file. Note that importing or inferring fields will overwrite any existing fields; it is intended only for new dataset creation. After importing a set of fields, check the column mapping as if you had inferred fields from data.

Learn more about exporting and importing sets of fields in this topic: Field Editor

Manually Define Fields

Instead of using a data-spreadsheet or JSON field definitions, you can click Manually Define Fields. You will also be able to use the manual field editor to adjust inferred or imported fields.

Note that the two required fields are predefined: ParticipantID and Date (or Visit/SequenceNum for visit-based studies). You cannot add these fields when defining a dataset manually; you only add the other fields in the dataset.

Click Add Field for each field you need. Use the Data Type dropdown to select the type, and click to expand field details to set properties.

If you add a field by mistake, click the to delete it.

After adding all your fields, click Save. You will now have an empty dataset and can import data to it.

LabKey Support

LabKey Support