Here's how the fasta parsing and annotation loading work:
Protein sequences are stored in CPAS by a unique constraint based on a hash of the sequence itself and the organism id of the sequence. So two identical sequences with different orgid values will have two entries in the database.
Protein sequences are added to the database by one of two types of file loads: a fasta file load or a UniprotXML file load. In either case if a sequence + orgid combination has been loaded previously, the existing entry in the database is used and any additional information about that sequence is associated with that existing entry.
A fasta file can be loaded by either of two paths:
- the fasta file is referenced by a search job and it hsasn't been loaded previously (implicit load)
- a site administrator goes to the Admin Console, select "Protein databases" under Management, and selects "Load New Annot file" from the bottom of the page (explicit load)
In the case of an implicit load, the user has no choice as to how the file is handled. CPAS uses its default behavior, which is to try several strategies for guessing the organism (genus, species) of an sequence entry from the content of the header line. This parsing and guessing has been improved since 2.1, but sometimes the header doesn't contain an organism that CPAS can find or guess. In this case the sequence is loaded under the default organism of "Unknown unknown". Any peptides or members of protein groups that the search engine determines to be matched to the fasta file entries are assigned those seqids. This is how your run data can end up referencing sequences that are listed on the protein detail pages as organism "unknown".
For an explicit load of a fasta file, the user has a choice of whether they want to have CPAS guess the organism, and whether they want to use a specific genus/species as the default instead of "Unknown unknown". If you know for sure that all of the entries in the fasta file are for a specific organism, you can turn off the guessing and assign all entries to a specific organism. If you think that most of the entries in the fasta file are from one organism, but some are correctly identified in their headers as belonging to a different organism, then leave the guessing option checked but give it a new default instead of "Unknown unknown".
When a UniprotXML file is loaded, the incoming sequences all have their organisms specified, so there is no guessing or default needed. The UniprotXML load will update existing sequences with a new BestName and GeneName, and will add new identifiers and annotations to the new or existing sequence records. If an identifier or annotation for the sequence already exists in the database it is left alone; otherwise this information from Uniprot is added to the protein database. Note that any sequences in the database with an organism value of "Unknown unknown" do not get updated by the UniprotXML file load because they won't exist in the XML.
So that leaves a couple of problems:
- if you have loaded a fasta file that resulted in sequences with "Unknown unknown" organisms and those sequences were associated with peptides or proteins from a run, there is not a built-in way to update those runs to point to sequences with the correct organism.
- If you have identifiers and annotations that are no longer valid, there is no built-in way to get rid of them.
I am looking into a couple of solutions for these two problems. I will send you separate mail to try them out.