This topic answers a few questions about how study data is stored inside the server:
- When you create a study dataset and import data to it, how is that data stored and managed in the system?
- How is data from different studies related?
- Is the data from different studies physically or logically separated?
Topics
Data Tables within Schema
When you create a study dataset, LabKey creates a corresponding table in the main database to hold the data. This table is created in the studyDataset schema. For example if you create the table
LabKey will create a corresponding table with a related name (container id string + Demographics), for example:
Import Process
When users import data to the Demographics table two things occur:
- The data is inserted into the studyDataset schema, namely into the corresponding database table "c10d79_demographics".
- The data is inserted into the study schema, namely into an extensive number of bookkeeping and junction tables. These tables are the pre-existing, system tables of the study schema, designed to support the application logic and the built-in reports provided by a LabKey Study. (As such, users should treat these tables as "read-only", and not manipulate them directly using PGAdmn or some other db tool. Users should view, edit, and design datasets only via the official LabKey mechanisms such as the study web UI, domain designer, APIs, archive import, etc.)
System Tables
These system tables are not created by the user, but already exist as part of the study schema, for example:
- Cohort - Contains one row per defined cohort.
- DataSetColumns - Metadata table containing one row of metadata for each column in all study datasets.
- DataSets - Contains one row of metadata per study dataset.
- Participant - Contains one row per subject. ParticipantIDs are limited to 32 characters.
- ParticipantVisit - Contains one row per subject/visit combination in the study. Note that this table is populated dynamically as dataset, assay, and other aligned data is loaded, and is guaranteed to always be complete.
- and so on...
Displaying Visit Labels
The ParticipantVisit table contains one row per participant/visit combination in a study. This can be a useful table for aligning data in queries and reports. Note that when you view "ParticipantVisit/Visit" in the UI, the display for that column will show the visit label if one exists, or the SequenceNum for the field if a visit label has not been defined. This is because the grid view will "follow the lookup" to an appropriate display field. However, if you include the "ParticipantVisit/Visit" column in a query, you will not see the the same 'followed lookup', but will instead see the rowID of that row in the table. To use the actual label in your query, you need to follow the lookup, i.e. use "ParticipantVisit/Visit/Label" instead.
Data Integration
Data imported from different studies is intermingled in these system-level study schema tables. (The data is
not intermingled in studyDataset schema tables.) To get a sense of how data is intermingled in these tables, open PGAdmin, (or the equivalent tool for MS SQL Server) and drill down to:
- Databases > labkey > Schemas > study > Tables.
- Right-click a table such as "study" or "dataset" and select View Data > View All Rows.
Notice that data from many apparently different datasets and study containers is intermingled there. (When the server displays data, it generally separates it by container, providing a form of data partitioning.)
Below is an example system table named "participant". Notice the "containerid" column which indicates which study folder a given participant belongs to.
container | participantid |
---|
"09f8feb4-e006-1033-acf0-1396fc22d1c4" | "PT-101" |
"09f8feb4-e006-1033-acf0-1396fc22d1c4" | "PT-102" |
"09f8feb4-e006-1033-acf0-1396fc22d1c4" | "PT-103" |
"143fdca3-7dd2-1033-816c-97696f0568c0" | "101344" |
"143fdca3-7dd2-1033-816c-97696f0568c0" | "103505" |
"143fdca3-7dd2-1033-816c-97696f0568c0" | "103866" |
Related Topics