Tutorial: Inferring Datasets from Excel and TSV Files: /Documentation/Archive/19.3

Tutorial: Inferring Datasets from Excel and TSV Files

Use the following step-by-step instructions to quickly create and import study datasets from Excel and/or TSV files. This method of creating datasets is especially useful for any of the following scenarios:

You want to avoid manually defining the fields in your study datasets.
You want to define and import study datasets as an initial, one-time job, and then develop your study using the server user interface.
You want to control your study data using files, instead of adding/updating data using the server user interface.

The overall process goes as follows:

Assemble and prepare your Excel and/or TSV files to conform to dataset requirements, namely, having (1) required fields (ParticipantId and timepoint fields) and (2) conforming to dataset uniqueness constraints.
Create the directory structure and XML metadata files for study archive.
Add the Excel/TSV files to the study archive.
Invoke the 'study reload' pipeline job to create the datasets and import the data.

Detailed instructions are below:

First prepare your Excel and/or TSV files. Each file will become a separate dataset in your study. Each file must meet the following requirements:

Each file must have a column that holds the participant ids, using the same column name across all files. Any column name will work, such as "ParticipantId" or "SubjectId", but whatever name you choose, it must be the same in all your datasets.
Each file must have the same way of indicating the time point information, either using dates or numbers. If you choose dates, then each file must have a column named "Date" (with calendar dates in that column). If you choose numbers, each file must have a column named "Visit" (with integers or decimal numbers in that column).
File names with trailing numbers (for example, Demographics1.xlsx) may result in unexpected reload behavior. To control how these files are reloaded, set up a File Watcher reload trigger and use a "name capture group" in the FilePattern property. For details see File Watcher.

Create a new folder of type "study". Once you have created the study folder, click Create Study. On the Create Study page ensure that the Subject Column Name and Timepoint Style are appropriate to the files you are about to import.

Export study properties:

Once your study has been created, click the Manage tab.
Scroll down to the bottom of the page and click Export Study.
Under Options select Pipeline root export directory, as individual files.
Click Export.

You will be redirected to the Files web part.

In the Files web part, open the directories export > study > datasets
In the datasets directory, select and delete the following three files: datasets_manifest.xml, datasets_metadata.xml, and Study.dataset. (These files are deleted because when they are present the server uses them to find the Excel/TSV files to import. When these files are removed, the server will import whatever Excel/TSV files it finds in the /datasets directory.)

Drag and drop your Excel/TSV files into the Files web part, into the export/study/datasets directory (where you just deleted the three files above).

In the Files web part, navigate to the "study" directory and select the file "study.xml".

Click Import Data, confirm that Reload Study is selected, and click Import.

On the Import Study from Pipeline page, make no changes, and click Start Import.
The server will examine your files, infer column names and data types, and import the data into new datasets.
Click the Clinical and Assay Data tab to view the datasets. Note that if you change the original Excel/TSV files and reload these changes, the server will replace the dataset data; update is not supported.

Once the datasets are imported, you can define the rest of your study properties, such as cohort information, visualizations, etc.

Troubleshooting

If there is a problem with the reload process, the server will indicate an error in the pipeline job. Click the Error link to see details on the problem.

Duplicate Row

Example error message:

10 Apr 2018 09:52:48,534 ERROR: PhysicalExam.xlsx -- Only one row is allowed for each Participant/Visit. Duplicates were found in the imported data.
10 Apr 2018 09:52:48,557 ERROR: PhysicalExam.xlsx -- Duplicate: Participant = P1, VisitSequenceNum = 1.0

This error message means that the PhysicalExam.xlsx file has a duplicate row of data for participant P1 / visit 1.0. Study datasets are allowed only one row of data for each participant/visit combination. If more than one row of data appears for a given participant/visit combination, data import will fail for the entire dataset.

The following violates the dataset uniqueness constraint:

The following dataset has duplicate subject id / timepoint combinations.

ParticipantId	Visit	SystolicBloodPressure
P-100	1	120
P-100	1	105
P-100	2	110
P-100	3	90

Solution:

Edit your Excel/TSV file to either (1) remove the duplicate row of data or (2) define a new visit number to retain the row of data.

The following table removes the duplicate row.

ParticipantId	Visit	SystolicBloodPressure
P-100	1	120
P-100	2	110
P-100	3	90

Missing Visit Column

Example error message:

10 Apr 2018 12:54:33,563 ERROR: Demographics.xlsx -- Row 1 Missing value for required property: SequenceNum

This error message arises in a Visit-based study when one of the files does not have a column named 'Visit'. (The server's internal name for the 'Visit' column is 'SequenceNum' -- that's why the error message mentions 'SequenceNum'.)

Solution:

Ensure that the timepoint information is captured in a column named 'Visit'.

Missing Date Column

Example error message:

10 Apr 2018 13:00:06,547 ERROR: DemographicsBad.xlsx -- Missing required field Date

This error message arises in a date-based study when one of the files does not have a column named 'Date'.

Solution:

Ensure that the timepoint information is captured in a column named 'Date'.

LabKey Support

LabKey Support