Loading Public Protein Annotation Files: /Documentation/Archive/17.2

Loading Public Protein Annotation Files

LabKey can load data from many types of public databases of protein annotations. It can then link loaded MS2 results to the rich, biologically-interesting information in these knowledge bases.

UniProtKB Species Suffix Map. Used to determine the genus and species of a protein sequence from a swiss protein suffix.
The Gene Ontology (GO) database. Provides the cellular locations, molecular functions, and metabolic processes of protein sequences.
UniProtKB (SwissProt and TrEMBL). Provide extensively curated protein information, including function, classification, and cross-references.
FASTA. Identifies regions of similarity among Protein or DNA sequences.

In addition to the public databases, you can create custom protein lists with your own annotations. More information can be found on the Using Custom Protein Annotations page.

More details about each public protein annotation database type are listed below.

UniProtKB Species Suffix Map

LabKey ships with a version of the UniProt organism suffix map and loads it automatically the first time it is required by the guess organism routines. It can also be manually (re)loaded from the MS2 admin page; however, this is not something LabKey administrators or users need to do. The underlying data change very rarely and the changes are not very important to LabKey Server. Currently, this dictionary is used to guess the genus and species from a suffix (though there are other potential uses for this data).

The rest of this section provides technical details about the creation, format, and loading of the SProtOrgMap.txt file.

The file is derived from the Uniprot Controlled Vocabulary of Species list:

http://www.uniprot.org/docs/speclist

The HTML from this page was hand edited to generate the file. The columns are sprotsuffix (swiss protein name suffix), superkingdomcode, taxonid, fullname, genus, species, common name and synonym. All fields are tab delimited. Missing species are replaced with the string "sp.". Swiss-Protein names (as opposed to accession strings) consist of 1 to 5 alphanumerics (uppercase), followed by an underscore and a suffix for the taxon. There are about 14,000 taxa represented in the file at present.

The file can be (re)loaded by visiting the Admin Console -> Protein Databases and clicking the "Reload SWP Org Map" button. LabKey will then load the file named ProtSprotOrgMap.txt in the MS2/externalData directory. The file is inserted into the database (prot.SprotOrgMap table) using the ProteinDictionaryHelpers.loadProtSprotOrgMap(fname) method.

Gene Ontology (GO) Database

LabKey loads five tables associated with the GO (Gene Ontology) database to provide details about cellular locations, molecular functions, and metabolic processes associated with proteins found in samples. If these files are loaded, a "GO Piechart" button will appear below filtered MS2 results, allowing you to generate GO charts based on the sequences in your results.

The GO databases are large (currently about 10 megabytes) and change on a monthly basis. Thus, a LabKey administrator must load them and should update them periodically. This is a simple, fast process.

To load the most recent GO database, go to Admin > Site > Admin Console, click Protein Databases and click the Load / Reload Gene Ontology Data button. LabKey Server will automatically download the latest GO data file, clear any existing GO data from your database, and upload new versions of all tables. On a modern server with a reasonably fast Internet connection, this whole process takes about three minutes. Your server must be able to connect directly to the FTP site listed below.

Linking results to GO information requires loading a UniProt or TREMBL file as well (see below).

The rest of this section provides technical details about the retrieval, format, and loading of GO database files.

LabKey downloads the GO database file from: ftp://ftp.geneontology.org/godatabase/archive/latest-full

The file has the form go_yyyyMM-termdb-tables.tar.gz, where yyyyMM is, for example, 201205. LabKey unpacks this file and loads the five files it needs (graph_path, term.txt, term2term.txt, term_definition, and term_synonym) into five database tables (prot.GoGraphPath, prot.GoTerm, prot.GoTerm2Term, prot.GoTermDefinition, and prot.GoTermSynonym). The files are tab-delimited with the mySQL convention of denoting a NULL field by using a "\N". The files are loaded into the database using the FtpGoLoader class.

Note that GoGraphPath is relatively large (currently 1.9 million records) because it contains the transitive closure of the 3 GO ontology graphs. It will grow exponentially as the ontologies increase in size.

Java 7 has known issues with FTP and the Windows firewall. Administrators must manually configure their firewall in order to use certain FTP commands. Not doing this will prevent LabKey from automatically loading GO annotations. To work around this problem, use the manual download option or configure your firewall as suggested in the these links:

UniProtKB (SwissProt and TrEMBL)

Note that loading these files is functional and reasonably well tested, but due to the immense size of the files, it can take many hours or days to load them on even high performing systems. When funding becomes available, we plan to improve the performance of loading these files.

The main source for rich annotations is the EBI (the European Biomolecular Institute) at:

ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete

The two files of interest are:

uniprot_sprot.xml.gz, which contains annotations for the Swiss Protein database. This database is smaller and richer, with far fewer entries but many more annotations per entry.
uniprot_trembl.xml.gz, which contains the annotations for the translated EMBL database (a DNA/RNA database). This database is more inclusive but has far fewer annotations per entry.

These are very large files. As of September 2007, the packed files are 360MB and 2.4GB respectively; unpacked, they are roughly six times larger than this. The files are released fairly often and grow in size on every release. See ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/README for more information about the information in these files.

To load these files:

Download the file of interest (uniprot_sprot.xml.gz or uniprot_trembl.xml.gz)
Unpack the file to a local drive on your LabKey web server
Visit Admin Console -> Protein Databases
Under Protein Annotations Loaded, click the Import Data button
On the Load Protein Annotations page, type the full path to the annotation file
Select uniprot type.
Click the button Load Annotations.

There is a sample XML file checked in to

.../sampledata/xarfiles/ms2pipe/annotations/Bovine_mini.uniprot.xml

This contains only the annotations associated with Bovine_mini.fasta file.

The uniprot xml files are parsed and added to the database using the XMLProteinLoader.parseFile() method.

FASTA

When LabKey loads results that were searched against a new FASTA file, it loads the FASTA file, including all sequences and any annotations that can be parsed from the FASTA header line. Every annotation is associated with an organism and a sequence. Guessing the organism can be problematic in a FASTA file. Several heuristics are in place and work fairly well, but not perfectly. Consider a FASTA file with a sequence definition line such as:

>xyzzy

You can not infer the organism from it. Thus, the FastaDbLoader has two attributes: DefaultOrganism (a String like "Homo sapiens" and OrganismIsToBeGuessed (a boolean) accessible through getters and setters setDefaultOrganism, getDefaultOrganism, setOrganismToBeGuessed, isOrganismToBeGuessed. These two fields are exposed on the insertAnnots.post page.

Why is there a "Should Guess Organism?" option? If you know that your FASTA file comes from Human or Mouse samples, you can set the DefaultOrganism to "Homo sapiens" or "Mus musculus" and tell the system not to guess the organism. In this case, it uses the default. This saves tons of time when you know your FASTA file came from a single organism.

Important caveat: Do not assume that the organism used as the name of the FASTA file is correct. The Bovine_Mini.fasta file, for example, sounds like it contains data from cows alone. In reality, it contains sequences from about 777 organisms.

LabKey Support

LabKey Support