Before loading any protein annotation files, be aware of the following:
- The protein annotations feature is experimental functionality right now. We are working to improve the performance and reliablity of loading protein annotations in future releases. The files can take many hours or days to load on even high performing systems.
- This documentation has not been updated to reflect files paths and user interface of the most recent releases of CPAS.
If you still want to try out this experimental feature, the documentation below may help.
Currently (10/05) the main sources for rich annotations can
be found at EBI (the European Biomolecular Institute) at:
ftp://ftp.ebi.ac.uk/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.xml.gzand
ftp://ftp.ebi.ac.uk/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_trembl.xml.gzunprot_sprot contains the annotations for the Swiss Protein
database. uniprot_trembl contains the annotations for the
translated EMBL database. (EMBL is a DNA/RNA database).
The former file is smaller and richer -- many more annotations.
The latter is more inclusive with a lot fewer annotations per entry.
Currently the unpacked version of these files are 1,428,408,665 and
9,933,357,518 bytes, respectively. The files are released fairly often
and grow immensely in size.
The primary way to load these files into the database is
by using the MS2 admin console and pressing the <load
new annot file> button. (Or go to .../MS2/insertAnnots.post) The files from EBI are
loaded with "uniprot" file type option.
There is a sample XML file checked in to
.../sampledata/xarfiles/ms2pipe/annotations/Bovine_mini.uniprot.xml
This contains only the annotations associated with Bovine_mini.fasta file.
The uniprot xml files are parsed and added to the database
using an org.fhcrc.cpas.protein.XMLProteinLoader object, the parseFile method.
=====
Besides the annotations proper, there are two dictionaries that are also
loaded into the database. The first is the Swiss Protein organism suffix map, which is loaded
into prot.protSprotOrgMap and the second is a set of five tables associated
with the GO (Gene Ontology) database.
The first table is derived from a download at NEWT:
http://www.expasy.org/cgi-bin/speclistUnfortunately, it was derived by hand-editing the HTML on this page. The columns
are sprotsuffix (swiss protein name suffix -- more later on this subject), superkingdomcode, taxonid,
fullname, genus, species, common name, synonym. All fields are tab delimited. Missing species
are replaced with the string "sp.". This file changes very rarely, and the changes
are not extremely important at Fred Hutch. Currently this file is used only to
guess the genus and species from a swiss protein suffix. However, there are many
other potential uses for the file.
Swiss-Protein names (as opposed to accession strings) consist of 1 to 5 alphanumerics
(alphas are all uppercase) followed by an underscore, followed by a suffix that indicates
the taxon. There are about 14,000 taxa (currently) represented in the file.
The ProtSprotOrgMap.txt file is inserted into the database using the
org.fhcrc.cpas.protein.tools.ProteinDictionaryHelpers.loadProtSprotOrgMap(fname)
method. This routine is exposed on the MS2 Admin page in the <Load SWP Org Map>
button. It is loaded the first
time it is required by the guessOrganism routines. It can also be manually (re)loaded from the MS2
admin page.
CPAS expects to see ProtSprotOrgMap.txt in the MS2/externalData
directory.
The GO tables are: prot.GoTerm, prot.GoTerm2Term, prot.GoTermDefinition,
prot.GoTermSynonym and (the monstrous) prot.GoGraphPath. These may be
downloaded at:
ftp://ftp.godatabase.org/godatabase/archive/latest-full/go_yyyyMM-termdb-tables.tar.gzWhere yyyyMM is, for example, 200510.
This unpacks into a directory called go_yyyyMM-termdb-tables containing 26 .txt files
and 26 .sql files. Of these, we only want term.txt, term2term.txt, term_definition,
term_synonym and graph_path.
These are tab-delimited files with the mySQL convention of denoting a NULL field by using a "\N". The files may be loaded into the database using the
org.fhcrc.cpas.protein.tools.ProteinDictionaryHelpers.loadGo() method, which is
exposed on the MS2 Admin page in the <Load/Reload GO> button. Note that this
routine expect files to be on ...MS2/externalData/GoTerm.txt, .../GoTerm2Term.txt,
etc. The same name as the tables. Routines for loading them individually and
from non-default file names are also available in ProteinDictionaryHelpers.
Note that GoGraphPath is relatively huge: currently 1.3 million records. This is
because it contains the transitive closure of the 3 GO ontology graphs. Thus it
can may well grow kindof worse-than-exponentially as the ontologies increase
in size.
=====
A note on FASTA files. FASTA files can be loaded either from the MS2 Admin page
or from an ms2 run requesting a FASTA file the system has never seen before. When
the system sees a new FASTA file it downloads the annotations and sequences. It makes
no difference which way it was loaded.
Annotations can only be loaded one process at a time. Otherwise there may be
database conflicts.
Every annotation is associated with an organism and a sequence. Guessing the
organism can be problemmatic in a FASTA file. Several heuristics are in place
and they work pretty well, but not perfectly. If a FASTA file has sequence
definition line like:
>xyzzy
You simply can't infer the organism from it. Thus, the FastaDbLoader has two
attributes: DefaultOrganism (a String like "Homo sapiens" and OrganismIsToBeGuessed
(a boolean) accessible through getters and setters setDefaultOrganism, getDefaultOrganism,
setOrganismToBeGuessed, isOrganismToBeGuessed. (The current source has a few spelling
errors with respect to this, but they'll be repaired.)
In the comet.def file, a user can specify these parameters, and the cometdefreader will
use them correctly: SHOULDGUESSORGANISM can be 1 or 0, DEFAULTORGANISM is an unquoted
string like: Homo sapiens. Capitalization counts.
IMPORTANT YET-TO-DO:
The MS2Writer does not yet have a way of specifying these. They will probably have
to be inserted into the pepXML standard.
These two fields are exposed on the insertAnnots.post page.
Why is there a "Should Guess Organism?" option? If you know that your FASTA file
is all from Human or Mouse, you can set the DefaultOrganism to "Homo sapiens" or
"Mus musculus" and tell the system not to guess the organism. In this case, it
uses the default. This saves tons of time when you know your FASTA file came from
a single organism. Important caveat, though: don't believe the organism that's
in the name of the FASTA file. The Bovine_Mini.fasta file, for example, sounds
like it just ought to represent cows. In reality it contains sequences from about
777 organisms.