When given two unlabeled samples that are input to a mass spectrometer, it is often desirable to assess whether a given protein exists in higher abundance in one sample compared to the other. One strategy for doing so is to count the spectra identified for each sample by the search engine. This technique requires a statistical comparison of multiple, repeated MS2 runs of each sample. LabKey Server makes handling the data from multiple runs straightforward.
To illustrate the technique, we will use mzXML files that were described in this paper as the "Variability Mix":
Jacob D. Jaffe, D. R. Mani, Kyriacos C. Leptos, George M. Church, Michael A. Gillette, and Steven A. Carr, "PEPPeR, a Platform for Experimental Proteomic Pattern Recognition", Molecular and Cellular Proteomics; 5: 1927 - 1941, October 2006
The datasets are derived from two sample protein mixes, alpha and beta, with varied concentrations of a specific list of 12 proteins. The samples were run on a Thermo Fisher Scientific LTQ FT Ultra Hybrid mass spectrometer. The resulting datafiles were converted to the mzXML format that was downloaded from Tranche.
The files named VARMIX_A through VARMIX_E were replicates of the Alpha mix. The files named VARMIX_K through VARMIX_O were the Beta mix.
You can see the examples in our online demo project.
The mzXML files provided with the PEPPeR paper included MS2 scan data. The first task is to get an MS2 search protocol that correctly identifies the 12 proteins spiked into the samples. The published data do not include the FASTA file to use as the basis of the search, so this has to be created from the descriptions in the paper. The paper did provide the search parameters used by the authors, but these were given for the SpectrumMill search engine, which is not freely available nor accessible from LabKey Server. So the SpectrumMill parameters are translated into their approximate equivalents on the X!Tandem search engine that is included with LabKey Server.
The PEPPeR paper gives the following information about the protein database against which they conducted their search:
Data from the Scale Mixes and Variability Mixes were searched against a small protein database consisting of only those proteins that composed the mixtures and common contaminants… Data from the mitochondrial preparations were searched against the International Protein Index (IPI) mouse database version 3.01 and the small database mentioned above.
The spiked proteins are identified in the paper by common names such as "Aprotinin". The paper did not give the specific protein database identifiers such as IPI numbers or SwissProt. The following list of 13 SwissProt names is based on Expasy searches using the given common names as search terms. (Note that "alpha-Casein" became two SwissProt entries).
Common Name | Organism | SprotName | Conc. In A | Conc. In B |
---|---|---|---|---|
Aprotinin | Cow | BPT1_BOVIN | 100 | 5 |
Ribonuclease | Cow | RNAS1_BOVIN | 100 | 100 |
Myoglogin | Horse | MYG_HORSE | 100 | 100 |
beta-Lactoglobulin | Cow | LACB_BOVIN | 50 | 1 |
alpha-Casein S2 | Cow | CASA2_BOVIN | 100 | 10 |
alpha-Casein S1 | Cow | CASA1_BOVIN | 100 | 10 |
Carbonic anhydrase | Cow | CAH2_BOVIN | 100 | 100 |
Ovalbumin | Chicken | OVAL_CHICK | 5 | 10 |
Fibrinogen beta chain | Cow | FIBB_BOVIN | 25 | 25 |
Albumin | Cow | ALBU_BOVIN | 200 | 200 |
Transferrin | Human | TRFE_HUMAN | 10 | 5 |
Plasminogen | Human | PLMN_HUMAN | 2.5 | 25 |
beta-Galactosidase | E. Coli | BGAL_ECOLI | 1 | 10 |
As in the PEPPeR study, the total search database consisted of
When analyzing a specific set of identified proteins as in this exercise, it is very useful to load the known data about the proteins as a custom protein annotation list. To add custom protein annotations using our example file attached to this page:
Spectra counts rely on the output of the search engine, and therefore the search parameters will likely affect the results. The original paper used SpectrumMill and gave its search parameters. For LabKey Server, the parameters must be translated to X!Tandem. These are the parameters applied:
<bioml>
<!-- Carbamidomethylation (C) -->
<note label="residue, modification mass" type="input">57.02@C</note>
<!-- Carbamylated Lysine (K), Oxidized methionine (M) -->
<note label="residue, potential modification mass" type="input">43.01@K,16.00@M</note>
<note label="scoring, algorithm" type="input">k-score</note>
<note label="spectrum, use conditioning" type="input">no</note>
<note label="pipeline quantitation, metabolic search type" type="input">normal</note>
<note label="pipeline quantitation, algorithm" type="input">xpress</note>
</bioml>
Notes on these choices:
One way to assess how well the X!Tandem search identified the known proteins in the mixtures is to compare the results across all 50 runs, or for the subsets of 25 runs that comprise the Alpha Mix set and the Beta Mix set. To enable easy grouping of the runs into Alpha and Beta mix sets, create two Run Groups (for example AlphaRunGroup and BetaRunGroup) and add the runs to them. Creating run groups is a sub function of the Add to run group button on the MS2 Runs (enhanced) grid.
After the run groups have been created, and runs assigned to them, it is easy to compare the protein identifications in samples from just one of the two groups by the following steps:
Most of the spiked proteins will show up in all 50 runs with a probability approaching 1.0. Two of the proteins, eta-Galactosidase and Plasminogen, appear in only half of the A mix runs. This is consistent with the low concentration of these two proteins in the Alpha mix as shown in t he table in an earlier section. Similarly, only beta-Lactoglobulin and Aprotinin fail to show up in all 25 of the runs for the B mix. These two are the proteins with the lowest concentration in beta.
Overall, the identifications seem to be strong enough to support a quantitation analysis.
The wide format of the ProteinProphet view is designed for viewing on-line. It can be downloaded to an Excel or TSV file, but the format is not well suited for further client-side analysis after downloading. For example, the existence of multiple columns of data under each run in Excel makes it difficult to reference the correct columns in formulas. The spectra count views address this problem. These views have a regular column structure with Run Id as just a single column.
After the options page, LabKey Server displays the resulting data grouped as specified. Selecting Grid Views > Customize Grid gives access to the column picker for choosing which data to aggregate, and what aggregate function to use. You can also specify a filter and ordering; these act after the grouping operation in the same way as SQL HAVING and ORDER BY apply after the GROUP BY.
Because the spectra count output is a single rectangular result set, there will be repeated information with some grouping options. In the peptide, protein grid, for example, the peptide data values will be repeated for every protein that the peptide could be matched to. The table below illustrates this type of grouping:
(row) | Run Id | Alpha Run Grp | Peptide | Charge States Obsv | Tot Peptide Cnt | Max PepProph | Protein | Prot Best Gene Name |
---|---|---|---|---|---|---|---|---|
1 | 276 | false | K.AEFVEVTK.L | 2 | 16 | 0.9925 | ALBU_BOVIN | ALB |
2 | 276 | false | K.ATEEQLK.T | 2 | 29 | 0.9118 | ALBU_BOVIN | ALB |
3 | 276 | false | K.C^CTESLVNR.R | 1 | 18 | 0.9986 | ALBU_BOVIN | ALB |
4 | 276 | false | R.GGLEPINFQTAADQAR.E | 1 | 4 | 0.9995 | OVAL_CHICK | SERPINB14 |
5 | 276 | false | R.LLLPGELAK.H | 1 | 7 | 0.9761 | H2B1A_MOUSE | Hist1h2ba |
6 | 276 | false | R.LLLPGELAK.H | 1 | 7 | 0.9761 | H2B1B_MOUSE | Hist1h2bb |
7 | 276 | false | R.LLLPGELAK.H | 1 | 7 | 0.9761 | H2B1C_MOUSE | Hist1h2bg |
8 | 299 | true | K.AEFVEVTK.L | 2 | 16 | 0.9925 | ALBU_BOVIN | ALB |
9 | 299 | true | K.ECCHGDLLECADDR.A | 1 | 12 | 0.9923 | ALBU_MOUSE | Alb |
10 | 299 | true | R.LPSEFDLSAFLR.A | 1 | 1 | 0.9974 | BGAL_ECOLI | lacZ |
11 | 299 | true | K.YLEFISDAIIHVLHSK.H | 2 | 40 | 0.9999 | MYG_HORSE | MB |
In this example, notice the following:
An Excel pivot table is a useful tool for consuming the datasets returned by the Spectra count comparison in LabKey Server. It is very fast, for example, for rolling up the Protein grouping data set and reporting ProteinProphet’s “Total Peptides” count, which is a count of spectra with some correction for the potential pitfalls in mapping peptides to proteins.
The spectra count data set can also be passed into an R script for statistical analysis, reporting and charting. R script files which illustrate this technique can be downloaded here. Note that column names are hard coded and may need adjustment to match your data.