IPI repeated in Search Engine Comparison

CPAS Forum (Inactive)
IPI repeated in Search Engine Comparison tvaisar  2008-01-28 14:37
Status: Closed
 
Hi Peter,

I am resurrecting a discussion I initiated back in November regarding repeated IPIs in Search Engine Comparison. This time I have a little bit different (although related to the original) problem:

I see duplicated IPIs when I compare two datasets searched against different versions of IPI database.
It is best shown on TRYP_PIG.
Both entries have unique SeqIDs.
However, I see no difference in the sequence as in the original FASTA file, except the header line.
If I display the two sequences in the CPAS I see two identical pages except in one case I get Organism assigned. However the Description is the same for both (some cross talk happened).
I see that we have multiple versions of both databases in the system???

Would you have any explanation for the problem and if yes, would there be a way to fix it?

Thanks,

Tomas
 
 
Peter responded:  2008-01-29 20:02
Here's how the fasta parsing and annotation loading work:

Protein sequences are stored in CPAS by a unique constraint based on a hash of the sequence itself and the organism id of the sequence. So two identical sequences with different orgid values will have two entries in the database.

Protein sequences are added to the database by one of two types of file loads: a fasta file load or a UniprotXML file load. In either case if a sequence + orgid combination has been loaded previously, the existing entry in the database is used and any additional information about that sequence is associated with that existing entry.

A fasta file can be loaded by either of two paths:

  • the fasta file is referenced by a search job and it hsasn't been loaded previously (implicit load)
  • a site administrator goes to the Admin Console, select "Protein databases" under Management, and selects "Load New Annot file" from the bottom of the page (explicit load)
In the case of an implicit load, the user has no choice as to how the file is handled. CPAS uses its default behavior, which is to try several strategies for guessing the organism (genus, species) of an sequence entry from the content of the header line. This parsing and guessing has been improved since 2.1, but sometimes the header doesn't contain an organism that CPAS can find or guess. In this case the sequence is loaded under the default organism of "Unknown unknown". Any peptides or members of protein groups that the search engine determines to be matched to the fasta file entries are assigned those seqids. This is how your run data can end up referencing sequences that are listed on the protein detail pages as organism "unknown".

For an explicit load of a fasta file, the user has a choice of whether they want to have CPAS guess the organism, and whether they want to use a specific genus/species as the default instead of "Unknown unknown". If you know for sure that all of the entries in the fasta file are for a specific organism, you can turn off the guessing and assign all entries to a specific organism. If you think that most of the entries in the fasta file are from one organism, but some are correctly identified in their headers as belonging to a different organism, then leave the guessing option checked but give it a new default instead of "Unknown unknown".

When a UniprotXML file is loaded, the incoming sequences all have their organisms specified, so there is no guessing or default needed. The UniprotXML load will update existing sequences with a new BestName and GeneName, and will add new identifiers and annotations to the new or existing sequence records. If an identifier or annotation for the sequence already exists in the database it is left alone; otherwise this information from Uniprot is added to the protein database. Note that any sequences in the database with an organism value of "Unknown unknown" do not get updated by the UniprotXML file load because they won't exist in the XML.

So that leaves a couple of problems:

  1. if you have loaded a fasta file that resulted in sequences with "Unknown unknown" organisms and those sequences were associated with peptides or proteins from a run, there is not a built-in way to update those runs to point to sequences with the correct organism.
  2. If you have identifiers and annotations that are no longer valid, there is no built-in way to get rid of them.
I am looking into a couple of solutions for these two problems. I will send you separate mail to try them out.
 
tvaisar responded:  2008-01-30 07:59
Hi Peter,

thanks so much for the detailed explanation of the database inner working. Now I understand the issues. I guess, since the data is external upload of Sequest search data, not a CPAS search, I could delete the data and delete the database which is w/o orgid (Unknown unknown) and reload the database from the Load Annot File and then reload the data.

Related to this is the problem we encountered recently on the production system where Rich uploaded the Uniprot and Trembl and as you describe above the BestName changed to the Uniprot ID, where available. Now the problem there is that for some proteins (IPIs) there is no Uniprot and the BestName for a sample run is a mix of various kinds of IDs (Uniprot/SwPr, Uniprot/Trembl and IPI). This makes it a mess for further data analysis and crossreference to other IDs (ENTREZ, Gene Name, RefSeq etc). Would there be a way to specify what ID you want the protein be reference by (e.g. only IPI), while still associated with the Uniprot annotation if available? I guess in Query beta this will be possible - currently we use the Search Engine Comparison as the spectral counting works there correctly and in there it is the BestName you get for protein ID.

Thanks,

Tomas

 
Peter responded:  2008-01-30 18:11
We've thought about how to allow user-specified updates to the bestName. It's not that great because it forces changes on all site users (there can be only one best name) but it might be a good solution for your lab. I just sent you some sql scripts to set that I think will do the job for you, as an interim measure to getting this functionality into the product itself.

We still need a good solution for loading the ipi full annotation files, but that's not a quick thing.