LabKey Data Structures: /Documentation

LabKey Data Structures

LabKey Server offers a wide variety of ways to store and organize data. Different data structure types offer specific features, which make them more or less suited for specific scenarios. You can think of these data structures as "table types", each designed to capture a different kind of research data.

This topic reviews the data structures available within LabKey Server, and offers guidance for choosing the appropriate structure for storing your data.

Where Should My Data Go?
Universal Table Features
Lists
Assays
Datasets
Custom Queries
Sample Types
DataClasses
Domains

Domain (Data Structure) Names
File Import Column Names (aka Parent/Source aliases)
Field Names with Special Characters

Vocabulary Domain
External Schemas
Linked Schemas

Where Should My Data Go?

The primary deciding factors when selecting a data structure will be the nature of the data being stored and how it will be used. Information about lab samples should likely be stored as a sample type. Information about participants/subjects/animals over time should be stored as datasets in a study folder. Less structured data may import into LabKey Server faster than highly constrained data, but integration may be more difficult. If you do not require extensive data integration or specialized tools, a more lightweight data structure, such as a list, may suit your needs.

The types of LabKey Server data structures appropriate for your work depend on the research scenarios you wish to support. As a few examples:

Management of Simple Tabular Data. Lists are a quick, flexible way to manage ordinary tables of data, such as lists of reagents.
Integration of Data by Time and Participant for Analysis. Study datasets support the collection, storage, integration, and analysis of information about participants or subjects over time.
Analysis of Complex Instrument Data. Assays help you to describe complex data received from instruments, generate standardized forms for data collection, and query, analyze and visualize collected data.

These structures are often used in combination. For example, a study may contain a joined view of a dataset and an assay with a lookup into a list for names of reagents used.

Universal Table Features

Most LabKey data structures support the following features:

Interactive, Online Grids
Data Validation
Visualizations
SQL Queries
Lookup Fields

Lists

Lists are the simplest and least constrained data type. They are generic, in the sense that the server does not make any assumptions about the kind of data they contain. Lists are not entirely freeform; they are still tabular data and have primary keys, but they do not require participant IDs or time/visit information. There are many ways to visualize and integrate list data, but some specific applications will require additional constraints.

Lists data can be imported in bulk as part of a TSV, or as part of a folder, study, or list archive. Lists also allow row-level insert/update/delete.

Lists are scoped to a single folder, and its child workbooks (if any).

Assays

Assays capture data from individual experiment runs, which usually correspond to an output file from some sort of instrument. Assays have an inherent batch-run-results hierarchy. They are more structured than lists, and support a variety of specialized structures to fit specific applications. Participant IDs and time information are required.

Specific assay types are available, which correspond to particular instruments and offer defaults specific to use of the given assay instrument. Results schema can range from a single, fixed table to many interrelated tables. All assay types allow administrators to configure fields at the run and batch level. Some assay types allow further customization at other levels. For instance, the Luminex assay type allows admins to customize fields at the analyte level and the results level. There is also a general purpose assay type, which allows administrators to completely customize the set of result fields.

Usually assay data is imported from a single data file at a time, into a corresponding run. Some assay types allow for API import as well, or have customized multi-file import pathways. Assays result data may also be integrated into a study by aligning participant and time information, or by sample id.

Assay designs are scoped to the container in which they are defined. To share assay designs among folders or subfolders, define them in the parent folder or project, or to make them available site-wide, define them in the Shared project. Run and result data can be stored in any folder in which the design is in scope.

Datasets

Clinical Datasets are designed to capture the variable characteristics of an organism over time, like blood pressure, mood, weight, and cholesterol levels. Anything you measure at multiple points in time will fit well in a Clinical Dataset.

Datasets are always part of a study. They have two required fields:

ParticipantId (this name may vary) - Holds the unique identifier for the study subject.
Date or Visit (the name may vary) - Either a calendar date or a number.

There are different types of datasets with different cardinality, also known as data row uniqueness:

Demographic: Zero or one row for each subject. For example, each participant has only one enrollment date.
“Standard”/"Clinical": Can have multiple rows per subject, but zero or one row for each subject/timepoint combination. For example, each participant has exactly one weight measurement at each visit.
“Extra key”/"Assay": can have multiple rows for each subject/timepoint combination, but have an additional field providing uniqueness of the subject/timepoint/arbitrary field combination. For example, many tests might be run each a blood sample collected for each participant at each visit.

Datasets have special abilities to automatically join/lookup to other study datasets based on the key fields, and to easily create intelligent visualizations based on these sorts of relationships.

A dataset can be backed by assay data that has been copied to the study. Behind the scenes, this consists of a dataset with rows that contain the primary key (typically the participant ID) of the assay result data, which is looked up dynamically.

Non-assay datasets can be imported in bulk (as part of a TSV paste or a study import), and can also be configurable to allow row-level inserts/updates/deletes.

Datasets are typically scoped to a single study in a single folder. In some contexts, however, shared datasets can be defined at the project level and have rows associated with any of its subfolders.

Datasets have their own study security configuration, where groups are granted access to datasets separately from their permission to the folder itself. Permission to the folder is a necessary prerequisite for dataset access (i.e., have the Reader role for the folder), but is not necessarily sufficient.

A special type of dataset, the query snapshot, can be used to extract data from some other sources available in the server, and create a dataset from it. In some cases, the snapshot is automatically refreshed after edits have been made to the source of the data. Snapshots are persisted in a physical table in the database (they are not dynamically generated on demand), and as such they can help alleviate performance issues in some cases.

Custom Queries

A custom query is effectively a non-materialized view in a standard database. It consists of LabKey SQL, which is exposed as a separate, read-only query/table. Every time the data in a custom query is used, it will be re-queried from the database.

In order to run the query, the current user must have access to the underlying tables it is querying against.

Custom queries can be created through the web interface in the schema browser, or supplied as part of a module.

Sample Types

Sample types allow administrators to create multiple sets of samples in the same folder, which each have a different set of customizable fields.

Sample types are created by pasting in a TSV of data and identifying one, two, or three fields that comprise the primary key. Subsequent updates can be made via TSV pasting (with options for how to handle samples that already exist in the set), or via row-level inserts/updates/deletes.

Sample types support the notion of one or more parent sample fields. When present, this data will be used to create an experiment run that links the parent and child samples to establish a derivation/lineage history. Samples can also have "parents" of other dataclasses, such as a "Laboratory" data class indicating where the sample was collected.

One sample type per folder can be marked as the “active” set. Its set of columns will be shown in Customize Grid when doing a lookup to a sample table. Downstream assay results can be linked to the originating sample type via a "Name" field -- for details see Samples.

Sample types are resolved based on the name. The order of searching for the matching sample type is: the current folder, the current project, and then the Shared project. See Shared Project.

DataClasses

DataClasses can be used to capture complex lineage and derivation information, for example, the derivations used in bio-engineering systems. Examples include:

Reagents
Gene Sequences
Proteins
Protein Expression Systems
Vectors (used to deliver Gene Sequences into a cell)
Constructs (= Vectors + Gene Sequences)
Cell Lines

You can also use dataclasses to track the physical and biological Sources of samples.

Similarities with Sample Types

A DataClass is similar to a Sample Type or a List, in that it has a custom domain. DataClasses are built on top of the exp.Data table, much like Sample Types are built on the exp.Materials table. Using the analogy syntax:

SampleType : exp.Material :: DataClass : exp.Data

Rows from the various DataClass tables are automatically added to the exp.Data table, but only the Name and Description columns are represented in exp.Data. The various custom columns in the DataClass tables are not added to exp.Data. A similar behavior occurs with the various Sample Type tables and the exp.Materials table.

Also like Sample Types, every row in a DataClass table has a unique name, scoped across the current folder. Unique names can be provided (via a Name or other ID column) or generated using a naming pattern.

For more information, see Data Classes.

Domains

A domain is a collection of fields. Lists, Datasets, SampleTypes, DataClasses, and the Assay Batch, Run, and Result tables are backed by an LabKey internal datatype known as a Domain. A Domain has:

a name
a kind (e.g. "List" or "SampleType")
an ordered set of fields along with their properties.

Each Domain type provides specialized handling for the domains it defines. The number of domains defined by a data type varies; for example, Assays define multiple domains (batch, run, etc.), while Lists and Datasets define only one domain each.

The fields and properties of a Domain can be edited interactively using the field editor or programmatically using the JavaScript LABKEY.Domain APIs.

Also see Modules: Domain Templates.

Domain/Data Structure Names

Data structures (Domains) like Sample Types, Source Types, Assay Designs, etc. must have unique names and avoid specific special characters, particularly if they are to be used in naming patterns or API calls. Names must follow these rules:

Must not be blank
Must start with a letter or a number character.
Must contain only valid unicode characters. (no control characters)
May not contain any of these characters:
```
<>[]{};,`"~!@#$%^*=|?\
```
May not contain 'tab', 'new line', or 'return' characters.
May not contain space followed by dash followed by a character.

i.e. these are allowed: "a - b" or "a-b" or "a–-b"
these are not allowed: "a -b", "a –-b"

For domains that support naming expressions (Sample Types, Sources), these special substitution strings are not allowed to be used as names:

AliquotedFrom
~DataInputs
DataInputs
Inputs
~MaterialInputs
MaterialInputs
batchRandomId
containerPath
contextPath
sampleCount
rootSampleCount
dailySampleCount
dataRegionName
genId
monthlySampleCount
now
queryName
randomId
schemaName
schemaPath
selectionKey
weeklySampleCount
withCounter
yearlySampleCount
folderPrefix

Names are not allowed to contain the following substrings. These are used as substitution operators internally:

:passThrough
:htmlEncode
:jsString
:urlEncode
:encodeURIComponent
:encodeURI
:first
:rest
:last
:trim
:date
:dailySampleCount
:weeklySampleCount
:yearlySampleCount
:monthlySampleCount
:defaultValue
:minValue
:number
:prefix
:suffix
:join
:withCounter

File Import Column Names (aka Parent/Source aliases)

Must not contain any of the following characters:
```
/:<>$[]{};,`"~!@#$%^*=|?\
```

Field Names with Special Characters

While field names themselves are allowed to contain special characters, it is best practice to have "database legal" names without them and use the special characters you want to show users in the field label.

You can find a list of all the field names using the "exp.Fields" table in the query browser. View data, use the grid folder filter to set the desired scope, then filter the Special Characters column for "true".

External Schemas

External schemas allow an administrator to expose the data in a "physical" database schema through the web interface, and programmatically via APIs. They assume that some external process has created the schemas and tables, and that the server has been configured to connect to the database.

Administrators have the option of exposing the data as read-only, or as insert/update/delete. The server will auto-populate standard fields like Modified, ModifiedBy, Created, CreatedBy, and Container for all rows that it inserts or updates. The standard bulk option (TSV, etc) import options are supported.

External schemas are scoped to a single folder. If an exposed table has a "Container" column, it will be filtered to only show rows whose values match the EntityId of the folder.

The server can connect to a variety of external databases, including Oracle, MySQL, SAS, Postgres, and SQLServer. The schemas can also be housed in the standard LabKey Server database.

The server does not support cross-database joins. It can do lookups (based on single-column foreign keys learned via JDBC metadata, or on XML metadata configuration) only within a single database though, regardless of whether it’s the standard LabKey Server database or not.

Learn more in this topic:

External Schemas and Data Sources

Linked Schemas

Linked schemas allow you to expose data in a target folder that is backed by some other data in a different source folder. These linked schemas are always read-only.

This provides a mechanism for showing different subsets of the source data in a different folder, where the user might not have permission (or need) to see everything else available in the source folder.

The linked schema configuration, set up by an administrator, can include filters such that only a portion of the data in the source schema/table is exposed in the target.

Learn more in this topic:

Linked Schemas and Tables

LabKey Support

LabKey Support