LabKey Data Structures

Documentation
LabKey Server offers a wide variety of ways to store and organize data. Different data types offer specific features, which make them more or less suited for specific scenarios. This topic reviews the data structures available within LabKey Server, and offers guidance for choosing the appropriate structure for storing your data. The primary deciding factors when selecting a data structure will be the nature of the data being stored and how it will be used. Information about samples should likely be stored as specimens or as a sample set. Information about participants/subjects/animals over time should be stored as datasets in a study folder. Less structured data may import into LabKey Server faster than highly constrained data, but integration may be more difficult. If you do not require extensive data integration or specialized tools, a more lightweight data structure, such as a list, may suit your needs.

The types of LabKey Server data structures appropriate for your work depend on the research scenarios you wish to support. As a few examples:

  • Management of Simple Tabular Data. Lists are a quick, flexible way to manage ordinary tables of data, such as lists of reagents.
  • Integration of Data by Time and Participant for Analysis. Study datasets support the collection, storage, integration, and analysis of information about participants or subjects over time.
  • Analysis of Complex Instrument Data. Assays help you to describe complex data received from instruments, generate standardized forms for data collection, and query, analyze and visualize collected data.
These structures are often used in combination. For example, a study may contain a joined view of a dataset and an assay with a lookup into a list for names of reagents used.

Lists

Lists are the simplest and least constrained data type. They are generic, in the sense that the server does not make any assumptions about the kind of data they contain. Lists are not entirely freeform; they are still tabular data and have primary keys, but they do not require participant IDs or time/visit information. There are many ways to visualize and integrate list data, but some specific applications will require additional constraints.

Lists data can be imported in bulk as part of a TSV, or as part of a list or folder archive. Lists also allow row-level insert/update/deletes.

Lists are scoped to a single folder, and its child workbooks (if any).

Assays

Assays capture data from individual experiment runs, which usually correspond to an output file from some sort of instrument. Assays have an inherent batch-run-results hierarchy. They are more structured than lists, and support a variety of specialized structures to fit specific applications. Participant IDs and time information are required.

Specific assay types are available, which correspond to particular instruments and offer defaults specific to use of the given assay instrument. Results schema can range from a single, fixed table to many interrelated tables. All assay types allow administrators to configure fields at the run and batch level. Some assay types allow further customization at other levels. For instance, the Luminex assay type allows admins to customize fields at the analyte level and the results level. There is also a general purpose assay type, which allows administrators to completely customize the set of result fields.

Usually assay data is imported from a single data file at a time, into a corresponding run. Some assay types allow for API import as well, or have customized multi-file import pathways. Assays result data may also be integrated into a study by aligning participant and time information, or by specimen id.

Assay designs are scoped to the container in which they are defined. To share assay designs among folders or subfolders, define them in the parent folder or project, or to make them available site-wide, define them in the Shared project. Run and result data can be stored in any folder in which the design is in scope.

Datasets

Datasets are always part of a study. They always contain information related to participants/subjects/animals/etc. There are different types of datasets with different cardinality: demographic (zero or one row for each subject), “standard”/"clinical" (zero or one row for each subject/timepoint combination), and “extra key”/"assay" (zero or one row for each subject/timepoint/arbitrary field combination).

Datasets have special abilities to automatically join/lookup to other study datasets based on the key types, and to create intelligent visualizations based on these sorts of relationships.

Datasets can be backed by assay data that has been copied to the study. Behind the scenes, this consists of a dataset with rows that contain the primary key (typically the participant ID) of the assay result data, which is looked up dynamically.

Non-assay datasets can be imported in bulk (as part of a TSV paste or a study import), and can also be configurable to allow row-level inserts/updates/deletes.

Datasets are typically scoped to a single study in a single folder. In some contexts, however, shared datasets can be defined at the project level and have rows associated with any of its subfolders.

Datasets can have special security configuration, where users are granted permission to see (or not see) and edit datasets separately from their permission to the folder itself. As such, permission to the folder is required to see the dataset (i.e., have the Reader role for the folder), but is not necessarily sufficient.

A special type of dataset, the query snapshot, can be used to extract data from some other sources available in the server, and create a dataset from it. In some cases, the snapshot is automatically refreshed after edits have been made to the source of the data. Snapshots are persisted in a physical table in the database (they are not dynamically generated on demand), and as such they can help alleviate performance issues in some cases.

Custom Queries

A custom query is effectively a non-materialized view in a standard database. It consists of LabKey SQL, which is exposed as a separate, read-only query/table. Every time the data in a custom query is used, it will be re-queried from the database.

In order to run the query, the current user must have access to the underlying tables it is querying against.

Custom queries can be created through the web interface in the schema browser, or supplied as part of a module.

Specimens

Specimens are always part of a study. They consist of multiple tables, including vials, specimens, primary type, etc. In addition to the required fields, administrators can customize the optional fields or add new ones for the specimens themselves.

Specimens are almost always loaded in bulk as part of a study or specimen import. It is possible to enable editing of specimens directly through the web UI as well, but this is not common.

Specimens support additional workflows around the creation, review, and approval of specimen requests to coordinate cross-site collaboration over a shared specimen repository.

The configuration for specimens is scoped to a single folder. Only one set of specimen configuration is supported per folder.

Behind the scenes, the server creates an entry in the experiment module’s material table (exp.Materials), which allows specimens to be the inputs or outputs of assay runs.

The specimen system is designed to work with millions of vial records.

Sample Sets

Sample sets allow administrators to create multiple sets of samples in the same folder, which each have a different set of customizable fields.

Sample sets are created by pasting in a TSV of data and identifying one, two, or three fields that comprise the primary key. Subsequent updates can be made via TSV pasting (with options for how to handle samples that already exist in the set), or via row-level inserts/updates/deletes.

Sample sets support the notion of a parent sample field. When present, this data will be used to create an experiment run that links the parent and child samples to establish a derivation/lineage history.

One sample set per folder can be marked as the “active” set. Its set of columns will be shown in Customize Grid when doing a lookup to a sample table. Downstream assay results can be linked to the originating sample set via a "Name" field -- for details see Sample Sets.

Sample sets are resolved based on the name. The order of searching for the matching sample set is: the current folder, the current project, and then the Shared project. See Shared Project.

DataClass

DataClasses can be used to capture complex lineage and derivation information, for example, the derivations used bio-engineering systems like the following:

  • Reagents
  • Gene Sequences
  • Proteins
  • Protein Expression Systems
  • Vectors (used to deliver Gene Sequences into a cell)
  • Constructs (= Vectors + Gene Sequences)
  • Cell Lines

Similarities with Sample Sets

A DataClass is similar to a Sample Set or a List, in that it has a custom domain. DataClasses are built on top of the exp.Data table, much like Sample Sets are built on the exp.Materials table. Using the analogy syntax:

SampleSet : exp.Material :: DataClass : exp.Data

Rows from the various DataClass tables are automatically added to the exp.Data table, but only the Name and Description columns are represented in exp.Data. The various custom columns in the DataClass tables are not added to exp.Data. A similar behavior occurs with the various Sample Set tables and the exp.Materials table.

Also like Sample Sets, every row in a DataClass table has a unique name, scoped across the current folder.

For detailed information, see DataClasses.

Domain

A domain is a collection of fields. Lists, Datasets, SampleSets, DataClasses, and the Assay Batch, Run, and Result tables are backed by an LabKey internal datatype known as a Domain. A Domain has:

  • a name
  • a kind (e.g. "List" or "SampleSet")
  • an ordered set of fields along with their properties.
Each Domain type provides specialized handling for the domains it defines. The number of domains defined by a data type varies; for example, Assays define multiple domains (batch, run, etc.), while Lists and Datasets define only one domain each.

The fields and properties of a Domain can be edited interactively using the domain editor or programmatically using the JavaScript LABKEY.Domain APIs.

Also see Modules: Domain Templates.

External Schemas

External schemas allow an administrator to expose the data in a “physical” database schema through the web interface, and programmatically via APIs. They assume that some external process has created the schemas and tables, and that the server has been configured to connect to the database, via a database connection config in the labkey.xml Tomcat deployment descriptor or its equivalent.

Administrators have the option of exposing the data as read-only, or as insert/update/delete. The server will auto-populate standard fields like Modified, ModifiedBy, Created, CreatedBy, and Container for all rows that it inserts or updates. The standard bulk option (TSV, etc) import options are supported.

External schemas are scoped to a single folder. If an exposed table has a “Container” column, it will be filtered to only show rows whose values match the EntityId of the folder.

The server can connect to a variety of external databases, including Oracle, MySQL, SAS, Postgres, and SQLServer. The schemas can also be housed in the standard LabKey Server database.

The server does not support cross-database joins. It can do lookups (based on single-column foreign keys learned via JDBC metadata, or on XML metadata configuration) only within a single database though, regardless of whether it’s the standard LabKey Server database or not.

Linked Schemas

Linked schemas allow you to expose data in a target folder that is backed by some other data type in a different source folder. These linked schemas are always read-only.

This provides a mechanism for showing different subsets of the source data in a different folder, where the user might not have permission to see it in the source folder.

The linked schema configuration, set up by an administrator, can include filters such that only a portion of the data in the source schema/table is exposed in the target.

Related Topics

Discussion

previousnext
 
expand all collapse all