Loading Protein Annotation Files

1.7
Before loading any protein annotation files, be aware of the following:
  • The protein annotations feature is experimental functionality right now. We are working to improve the performance and reliablity of loading protein annotations in future releases. The files can take many hours or days to load on even high performing systems.
  • This documentation has not been updated to reflect files paths and user interface of the most recent releases of CPAS.
If you still want to try out this experimental feature, the documentation below may help.



Currently (10/05) the main sources for rich annotations can be found at EBI (the European Biomolecular Institute) at:

ftp://ftp.ebi.ac.uk/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.xml.gz

and

ftp://ftp.ebi.ac.uk/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_trembl.xml.gz

unprot_sprot contains the annotations for the Swiss Protein database. uniprot_trembl contains the annotations for the translated EMBL database. (EMBL is a DNA/RNA database). The former file is smaller and richer -- many more annotations. The latter is more inclusive with a lot fewer annotations per entry.

Currently the unpacked version of these files are 1,428,408,665 and 9,933,357,518 bytes, respectively. The files are released fairly often and grow immensely in size.

The primary way to load these files into the database is by using the MS2 admin console and pressing the <load new annot file> button. (Or go to .../MS2/insertAnnots.post) The files from EBI are loaded with "uniprot" file type option.

There is a sample XML file checked in to

.../sampledata/xarfiles/ms2pipe/annotations/Bovine_mini.uniprot.xml

This contains only the annotations associated with Bovine_mini.fasta file.

The uniprot xml files are parsed and added to the database using an org.fhcrc.cpas.protein.XMLProteinLoader object, the parseFile method.

=====

Besides the annotations proper, there are two dictionaries that are also loaded into the database. The first is the Swiss Protein organism suffix map, which is loaded into prot.protSprotOrgMap and the second is a set of five tables associated with the GO (Gene Ontology) database.

The first table is derived from a download at NEWT:

http://www.expasy.org/cgi-bin/speclist

Unfortunately, it was derived by hand-editing the HTML on this page. The columns are sprotsuffix (swiss protein name suffix -- more later on this subject), superkingdomcode, taxonid, fullname, genus, species, common name, synonym. All fields are tab delimited. Missing species are replaced with the string "sp.". This file changes very rarely, and the changes are not extremely important at Fred Hutch. Currently this file is used only to guess the genus and species from a swiss protein suffix. However, there are many other potential uses for the file.

Swiss-Protein names (as opposed to accession strings) consist of 1 to 5 alphanumerics (alphas are all uppercase) followed by an underscore, followed by a suffix that indicates the taxon. There are about 14,000 taxa (currently) represented in the file.

The ProtSprotOrgMap.txt file is inserted into the database using the org.fhcrc.cpas.protein.tools.ProteinDictionaryHelpers.loadProtSprotOrgMap(fname) method. This routine is exposed on the MS2 Admin page in the <Load SWP Org Map> button. It is loaded the first time it is required by the guessOrganism routines. It can also be manually (re)loaded from the MS2 admin page.

CPAS expects to see ProtSprotOrgMap.txt in the MS2/externalData directory.

The GO tables are: prot.GoTerm, prot.GoTerm2Term, prot.GoTermDefinition, prot.GoTermSynonym and (the monstrous) prot.GoGraphPath. These may be downloaded at:

ftp://ftp.godatabase.org/godatabase/archive/latest-full/go_yyyyMM-termdb-tables.tar.gz

Where yyyyMM is, for example, 200510.

This unpacks into a directory called go_yyyyMM-termdb-tables containing 26 .txt files and 26 .sql files. Of these, we only want term.txt, term2term.txt, term_definition, term_synonym and graph_path.

These are tab-delimited files with the mySQL convention of denoting a NULL field by using a "\N". The files may be loaded into the database using the org.fhcrc.cpas.protein.tools.ProteinDictionaryHelpers.loadGo() method, which is exposed on the MS2 Admin page in the <Load/Reload GO> button. Note that this routine expect files to be on ...MS2/externalData/GoTerm.txt, .../GoTerm2Term.txt, etc. The same name as the tables. Routines for loading them individually and from non-default file names are also available in ProteinDictionaryHelpers.

Note that GoGraphPath is relatively huge: currently 1.3 million records. This is because it contains the transitive closure of the 3 GO ontology graphs. Thus it can may well grow kindof worse-than-exponentially as the ontologies increase in size.

=====

A note on FASTA files. FASTA files can be loaded either from the MS2 Admin page or from an ms2 run requesting a FASTA file the system has never seen before. When the system sees a new FASTA file it downloads the annotations and sequences. It makes no difference which way it was loaded.

Annotations can only be loaded one process at a time. Otherwise there may be database conflicts.

Every annotation is associated with an organism and a sequence. Guessing the organism can be problemmatic in a FASTA file. Several heuristics are in place and they work pretty well, but not perfectly. If a FASTA file has sequence definition line like:

>xyzzy

You simply can't infer the organism from it. Thus, the FastaDbLoader has two attributes: DefaultOrganism (a String like "Homo sapiens" and OrganismIsToBeGuessed (a boolean) accessible through getters and setters setDefaultOrganism, getDefaultOrganism, setOrganismToBeGuessed, isOrganismToBeGuessed. (The current source has a few spelling errors with respect to this, but they'll be repaired.)

In the comet.def file, a user can specify these parameters, and the cometdefreader will use them correctly: SHOULDGUESSORGANISM can be 1 or 0, DEFAULTORGANISM is an unquoted string like: Homo sapiens. Capitalization counts.

IMPORTANT YET-TO-DO:

The MS2Writer does not yet have a way of specifying these. They will probably have to be inserted into the pepXML standard.

These two fields are exposed on the insertAnnots.post page.

Why is there a "Should Guess Organism?" option? If you know that your FASTA file is all from Human or Mouse, you can set the DefaultOrganism to "Homo sapiens" or "Mus musculus" and tell the system not to guess the organism. In this case, it uses the default. This saves tons of time when you know your FASTA file came from a single organism. Important caveat, though: don't believe the organism that's in the name of the FASTA file. The Bovine_Mini.fasta file, for example, sounds like it just ought to represent cows. In reality it contains sequences from about 777 organisms.


previousnext
 
expand allcollapse all