More details about each public protein annotation database type are listed below.
LabKey ships with a version of the UniProt organism suffix map and loads it automatically the first time it is required by the guess organism routines. It can also be manually (re)loaded from the MS2 admin page; however, this is not something LabKey administrators or users need to do. The underlying data change very rarely and the changes are not very important to LabKey Server. Currently, this dictionary is used to guess the genus and species from a suffix (though there are other potential uses for this data).
The rest of this section provides technical details about the creation, format, and loading of the SProtOrgMap.txt file.
The file is derived from the Uniprot Controlled Vocabulary of Species list:
http://www.uniprot.org/docs/speclist
The HTML from this page was hand edited to generate the file. The columns are sprotsuffix (swiss protein name suffix), superkingdomcode, taxonid, fullname, genus, species, common name and synonym. All fields are tab delimited. Missing species are replaced with the string "sp.". Swiss-Protein names (as opposed to accession strings) consist of 1 to 5 alphanumerics (uppercase), followed by an underscore and a suffix for the taxon. There are about 14,000 taxa represented in the file at present.
The file can be (re)loaded by visiting the Admin Console -> Protein Databases and clicking the "Reload SWP Org Map" button. LabKey will then load the file named ProtSprotOrgMap.txt in the MS2/externalData directory. The file is inserted into the database (prot.SprotOrgMap table) using the ProteinDictionaryHelpers.loadProtSprotOrgMap(fname) method.
LabKey loads five tables associated with the GO (Gene Ontology) database to provide details about cellular locations, molecular functions, and metabolic processes associated with proteins found in samples. If these files are loaded, a "GO Piechart" button will appear below filtered MS2 results, allowing you to generate GO charts based on the sequences in your results.
The GO databases are large (currently about 10 megabytes) and change on a monthly basis. Thus, a LabKey administrator must load them and should update them periodically. This is a simple, fast process.
To load the most recent GO database, go to Admin > Site > Admin Console, click Protein Databases and click the Load / Reload Gene Ontology Data button. LabKey Server will automatically download the latest GO data file, clear any existing GO data from your database, and upload new versions of all tables. On a modern server with a reasonably fast Internet connection, this whole process takes about three minutes. Your server must be able to connect directly to the FTP site listed below.
Linking results to GO information requires loading a UniProt or TREMBL file as well (see below).
The rest of this section provides technical details about the retrieval, format, and loading of GO database files.
LabKey downloads the GO database file from: ftp://ftp.geneontology.org/godatabase/archive/latest-full
The file has the form go_yyyyMM-termdb-tables.tar.gz, where yyyyMM is, for example, 201205. LabKey unpacks this file and loads the five files it needs (graph_path, term.txt, term2term.txt, term_definition, and term_synonym) into five database tables (prot.GoGraphPath, prot.GoTerm, prot.GoTerm2Term, prot.GoTermDefinition, and prot.GoTermSynonym). The files are tab-delimited with the mySQL convention of denoting a NULL field by using a "\N". The files are loaded into the database using the FtpGoLoader class.
Note that GoGraphPath is relatively large (currently 1.9 million records) because it contains the transitive closure of the 3 GO ontology graphs. It will grow exponentially as the ontologies increase in size.
Java 7 has known issues with FTP and the Windows firewall. Administrators must manually configure their firewall in order to use certain FTP commands. Not doing this will prevent LabKey from automatically loading GO annotations. To work around this problem, use the manual download option or configure your firewall as suggested in the these links:
Note that loading these files is functional and reasonably well tested, but due to the immense size of the files, it can take many hours or days to load them on even high performing systems. When funding becomes available, we plan to improve the performance of loading these files.
The main source for rich annotations is the EBI (the European Biomolecular Institute) at:
ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete
The two files of interest are:
To load these files:
.../sampledata/xarfiles/ms2pipe/annotations/Bovine_mini.uniprot.xml
This contains only the annotations associated with Bovine_mini.fasta file.
The uniprot xml files are parsed and added to the database using the XMLProteinLoader.parseFile() method.
When LabKey loads results that were searched against a new FASTA file, it loads the FASTA file, including all sequences and any annotations that can be parsed from the FASTA header line. Every annotation is associated with an organism and a sequence. Guessing the organism can be problematic in a FASTA file. Several heuristics are in place and work fairly well, but not perfectly. Consider a FASTA file with a sequence definition line such as:
>xyzzy
You can not infer the organism from it. Thus, the FastaDbLoader has two attributes: DefaultOrganism (a String like "Homo sapiens" and OrganismIsToBeGuessed (a boolean) accessible through getters and setters setDefaultOrganism, getDefaultOrganism, setOrganismToBeGuessed, isOrganismToBeGuessed. These two fields are exposed on the insertAnnots.post page.
Why is there a "Should Guess Organism?" option? If you know that your FASTA file comes from Human or Mouse samples, you can set the DefaultOrganism to "Homo sapiens" or "Mus musculus" and tell the system not to guess the organism. In this case, it uses the default. This saves tons of time when you know your FASTA file came from a single organism.
Important caveat: Do not assume that the organism used as the name of the FASTA file is correct. The Bovine_Mini.fasta file, for example, sounds like it contains data from cows alone. In reality, it contains sequences from about 777 organisms.