The LabKey data processing pipeline
allows you to process and import data files with tools we supply, or with tools you build on your own. You can set a pipeline override to allow the data processing pipeline to operate on files in a preferred, pre-existing directory instead of the directory where LabKey ordinarily stores files for a project. Note that you can still use the data processing pipeline without setting up a pipeline override if the system's default locations for file storage are sufficient for you.
A pipeline override is a directory on the file system accessible to the web server where the server can read and write files. Usually the pipeline override is a shared directory on a file server, where data files can be deposited (e.g., after MS/MS runs). You can also set the pipeline override to be a directory on your local computer.
Before you set the pipeline override, you may want to think about how your file server is organized. The pipeline override directory is essentially a window into your file system
, so you should make sure that the directories beneath the override directory will contain only files that users of your LabKey system should have permissions to see. On the LabKey side, subfolders inherit pipeline override settings, so once you set the override, LabKey can upload data files from the override directory tree into the folder and any subfolders.
Single Machine Setup
These steps will help you set up the pipeline, including an override directory, for usage on a single computer. For information on setup for a distributed environment, see the next section
- Select (Admin) > Go to Module > Pipeline.
- Click Setup. (Note: you must be a Site Administrator to see the Setup option.)
- You will now see the "Data Processing Pipeline Setup" page.
- Select Set a pipeline override.
- Specify the Primary Directory from which your dataset files will be loaded.
- Click the Searchable box if you want the pipeline override directory included in site searches. By default, the materials in the pipeline override directory are not indexed.
- For MS2 Only, you have the option to include a Supplemental Directory from which dataset files can be loaded. No files will be written to the supplemental directory.
- You may also choose to customize Pipeline Files Permissions using the panel to the right.
- Click Save.
Notice that you also have the option to override email notification settings
at this level if desired .
Include Supplemental File Location (Optional)
MS2 projects that set a pipeline override can specify a supplemental, read-only directory, which can be used as a repository for your original data files. If a supplemental directory is specified, LabKey Server will treat both directories as sources for input data to the pipeline, but it will create and change files only in the first, primary directory.
Note that UNC paths are not supported for pipeline roots here. Instead, create a network drive mapping configuration
via (Admin) > Site > Admin Console > Settings > Configuration > Files
. Then specify the letter mapped drive path as the supplemental file location.
Set Pipeline Files Permissions (Optional)
By default, pipeline files are not shared. To allow pipeline files to be downloaded or updated via the web server, check the Share files via web site
checkbox. Then select appropriate levels of permissions for members of global and project groups
Configure Network Drive Mapping (Optional)
If you are running LabKey Server on Windows and you are connecting to a remote network share, you may need to configure network drive mapping for LabKey Server
so that LabKey Server can create the necessary service account to access the network share. For more information, see Installation: SMTP, Encryption, LDAP, and File Roots
Additional Options for MS2 Runs
When setting up a single machine for MS2 runs, notice the Supplemental File Location
when setting up the pipeline to read files from an additional data source directory. In addition, other options include:
Set the FASTA Root for Searching Proteomics Data
The FASTA root is the directory where the FASTA databases that you will use for peptide and protein searches against MS/MS data are located. FASTA databases may be located within the FASTA root directory itself, or in a subdirectory beneath it.
To configure the location of the FASTA databases used for peptide and protein searches against MS/MS data:
- On the MS2 Dashboard, click Setup in the Data Pipeline web part.
- Under MS2 specific settings, click Set FASTA Root.
- By default, the FASTA root directory is set to point to a /databases directory nested in the directory that you specified for the pipeline override. However, you can set the FASTA root to be any directory that's accessible by users of the pipeline.
- Click Save.
Selecting the Allow Upload
checkbox permits users with admin privileges to upload FASTA files to the FASTA root directory. If this checkbox is selected, the Add FASTA File
link appears under MS2 specific settings
on the data pipeline setup page. Admin users can click this link to upload a FASTA file from their local computer to the FASTA root on the server.
If you prefer to control what FASTA files are available to users of your LabKey Server site, leave this checkbox unselected. The Add FASTA File
link will not appear on the pipeline setup page. In this case, the network administrator can add FASTA files directly to the root directory on the file server.
By default, all subfolders will inherit the pipeline configuration from their parent folder. You can override this if you wish.
When you use the pipeline to browse for files, it will remember where you last loaded data for your current folder and bring you back to that location. You can click on a parent directory to change your location in the file system.
Set X! Tandem, Sequest, or Mascot Defaults for Searching Proteomics Data
You can specify default settings for X! Tandem, Sequest or Mascot for the data pipeline in the current project or folder. On the pipeline setup page, click the Set defaults
link under X! Tandem specific settings, Sequest specific settings
, or Mascot specific settings
The default settings are stored at the pipeline override in a file named default_input.xml. These settings are copied to the search engine's analysis definition file (named tandem.xml, sequest.xml or mascot.xml by default) for each search protocol that you define for data files beneath the pipeline override. The default settings can be overridden for any individual search protocol. See Search and Process MS2 Data
for information about configuring search protocols.
Setup for Distributed Environment
The pipeline that is installed with a standard LabKey installation runs on a single computer. Since the pipeline's search and analysis operations are resource-intensive, the standard pipeline is most useful for evaluation and small-scale experimental purposes.
For institutions performing high-throughput experiments and analyzing the resulting data, the pipeline is best run in a distributed environment, where the resource load can be shared across a set of dedicated servers. Setting up the LabKey pipeline to leverage distributed processing demands some customization as well as a high level of network and server administrative skill. If you wish to set up the LabKey pipeline for use in a distributed environment, contact LabKey