The mzXML files provided with the PePPER paper included both MS1 and MS2 scan data. The first challenging task was to get an MS2 search protocol that correctly identified the 12 proteins spiked into the samples. The published data did not include the fasta file to use as the basis of the search, so that had to be created from the descriptions in the paper. The paper did provide the search parameters used by the authors, but these were given for the SpectrumMill search engine, which is not freely available nor accessible from CPAS. So the SpectrumMill parameters were translated into their approximate equivalents on the X!Tandem search engine that is included with CPAS.

Creating the right FASTA file

The PePPER paper gives the following information about the protein database against which they conducted their search:

Data from the Scale Mixes and Variability Mixes were searched against a small protein database consisting of only those proteins that composed the mixtures and common contaminants... Data from the mitochondrial preparations were searched against the International Protein Index (IPI) mouse database version 3.01 and the small database mentioned above.

It proved difficult to replicate this search database combination. The spiked proteins are identified in the paper by names that do not generally resolve to a single protein sequence when using the common search engines such as Expasy and Entrez . Below is a link to the list of the proteins mixture as described in the paper, along with the SwissProt identifiers used in the target search fasta file:

Pepper SpikedProteins.tsv

As in the PEPPeR study, the total search database consisted of

  1. the spiked proteins as listed in the table, using SwissProt identifiers
  2. the Mouse IPI fasta database, using IPI identifiers
  3. the cRAP list of common contaminants from thegpm.org, minus the proteins that overlapped with the spiked proteins (including other species versions of those spiked proteins. This list used a different format of Swiss-prot identifiers.

Using different identifier formats for the three sets of sequences in the search database had the side effect of making it very easy to distinguish expected from unexpected proteins.