Custom annotation

CPAS Forum (Inactive)
Custom annotation wnels2  2007-12-04 11:38
Status: Closed
 
I'd like to add custom annotation to a project.
I do not understand how to name the colmns of the tsv file. I'm thinking that the 1st column name must link to a column that is already defined in CPAS's database?

There is an IPI example:
IPINumber    RandomValue    Description
IPI00003986    5    T-cell receptor beta chain V region YT35 precursor

that links to:
>IPI:IPI00003986.1|SWISS-PROT:P01733 Tax_Id=9606 Gene_Symbol=- T-cell receptor beta chain V region YT35 precursor

And a demo example:
GeneName    Peroxisomal
PEX6    TRUE
that links to:
>UPSP:PEX6_YEAST P33760 saccharomyces cerevisiae (baker's yeast). Tax_Id=4932 peroxisomal biogenesis factor 6 (peroxin-6) (peroxisome biosynthesis protein pas8). 2/2007

If I want to link:
UinProtAccNum    UPS1
P00709    TRUE

to the swiss-prot fasta header:
>P00709|LALBA_HUMAN Alpha-lactalbumin precursor - Homo sapiens (Human) [MASS=16225]

What is the correct column name for UinProtAccNum? How do I figure out my choices?

Thanks a lot,
Bill
 
 
jeckels responded:  2007-12-04 15:28
Bill,

The name of the first column doesn't matter. The names of the other columns (if there are any other columns) will be the name of the properties you can add when you customize your views. You tell CPAS what kind of protein name you're using with the drop-down.

Version 2.3 only supports IPI, GeneName, and SwissProt for protein names. Version 2.4 will add SwissProtAccn, which should address your need. Please let me know if there are other types of protein names you'd like to see supported in the future.

Thanks,
Josh
 
Peter responded:  2007-12-04 17:05
also, if you are using 2.2 or earlier, many swiss-prot fasta headers do not get parsed correctly to recognize them as swiss-prot. This is true of current fasta files that you download from Expasy-- they look like the LALBA_HUMAN example you gave, and the fasta loader wasn't recognizing the P00709|LALBA_HUMAN as a SwissProtAcesssion|SwissProtName. Without recognizing that, cpas can't tie that sequence to any other identifiers or annotations, custom or otherwise.

this is fixed in 2.3.
 
wnels2 responded:  2007-12-06 09:34
Thanks for the feedback but I think I'm still missing something.
I have been able to get the custom annotation to work with the IPI databases but my lab is resistant to using IPI because the linking accession numbers obscur the protein description on the CPAS reports and they don't want to see all of the iso-forms of a protein.
I have not been able to get custom annotation to work with Swiss Prot. Are you saying that it does not work at all with a swiss database or just with specific lookup strings? I have tryied P00709|LALBA_HUMAN, P00709, LALBA, and LALBA_HUMAN without success.

Is there a table that has the correct lookup string (for example I noticed there was a lookupstring field in prot.fastasequences)?
Can I reformat the fasta header of the swiss prot database as a work around (add UPSP: to the front of each header)?


Also, another question regarding how the fasta headers are parsed;
I would like to implement a decoy database. Normally we append a reverse copy of the database to itself with a REV prefix on the reverse sequence's header(for example >REVP00709|LALBA_HUMAN Alpha-lactalbumin precursor - Homo sapiens (Human) [MASS=16225]
). Is this the best approach for CPAS?

Thanks again for you help.

Bill
 
Peter responded:  2007-12-06 17:16
The problem is that the fasta parser in 2.2 didn't recognize "untyped" sprot identifiers in the header, so it was never associating an sprot protein name with the sequence. By untyped I mean there is no leading character sequence that identifies LALBA_HUMAN as a swissprot id. I fixed this for v2.3 but if you have to load on 2.2 you can reformat your header lines to look like

>SPROT_NAME|LALBA_HUMAN

and the 2.2 fasta loader should correctly recognize the "SPROT_NAME" as the code. It will add an identifier of type SwissProt value LALBA_HUMAN. You can then match it up to a custom annotation keyed by swissprot name. (Note you can't match by Sprot Accession, only by name).

If you want to distinguish dummy sequences you might try naming them as if they were swissprot names, e.g.

>SPROT_NAME|LALBA_HUREV
 
wnels2 responded:  2008-01-18 09:49
Greetings,
I'm trying to add custom annotations to SwissProt again now that I have 2.3 installed. Can you help me clairify a couple of points.

1. Peter had mentioned that the SwissProt headers were being parsed incorrectly on 2.2. Do I need to delete and reload the swissProt databases?

2. Josh said that 2.3 will accept now GeneName and SwissProt in addition to IPI.

IPI was working before so I know it joins to an IPI accession number.

What exactly do GeneName, and SwissProt join to? Does geneName work on any database? is the format of gene name UBE2C_HUMAN or just UBE2C?

Does swissProt join to O00762|UBE2C_HUMAN?

Thanks,
Bill
 
wnels2 responded:  2008-01-24 09:37
I have made a little progress but I'm stuck now.
* IPI will link to the IPI accession number (IPI00217493.5)
* Swiss Prot will link to the thing that comes after the swiss prot accession number (MYG_HUMAN)
* Gene Name will link to the HUGO Gene Nomenclature Committee version of the gene name. I was assuming it was MYG but in this case it is MB.

Also I have found that the databases are cross referenced so that I can use IPI to link to a SwissProt databaase and and SwissProt to link to an IPI database; very nice. I'm really excited about showing this new feature to the lab if I could just understand a couple of more details.

1. The parsing of the databases seem to have gone awry. I have clicked the Update SeqIds button for most of the databases.
  a. Most of the proteins in my databases (Human SwissProt and IPI) have empty First IPI, First Gene Name, and First Swiss Prot (First Swiss Prot sometimes will contain SPROT_NAME) columns.
  b. The Best Name column will have one of three formats;
P16083|NQO2_HUMAN (only found in Swiss Prot database searches),
GSTA1_HUMAN , or
IPI:IPI00655754.1|SWISS-PROT:Q92890-1 (Mostly found in IPI databases but showed up once in a Swiss prot search)

2. I have loaded GO and uniprot_sprot.xml. GO is working if the protein was well parsed but I'm not sure how to know if uniprot_sprot.xml is O.K.(I don't know how to access its extra annotation).

These are really great new features and I would like to show them off.
How can I fix the parsing and display the uniprot annotation?
Any suggestions are greatly appreciated.

Thanks,
Bill
 
wnels2 responded:  2008-01-25 05:46
Greetings:
I will send a bottle of authentic hand crafted Kentucky bourbon to anyone who can help me resolve this thread.

http://www.kybourbon.com/english/pages/BourbonTrail.pdf.

Cheers,
Bill
 
Peter responded:  2008-01-30 18:18
Bill,

see my explanation of the mechanisms for fasta and annotation file loading that I posted at https://www.labkey.org/announcements/home/CPAS/support/thread.view?rowId=2118

I also sent you some sql scripts to try out for fixing your problems. They are prototypes of what I'd like to put in the product itself so your testing and feedback will be very helpful.
 
wnels2 responded:  2008-03-11 11:04
Hi,
I'm still working on this. It is looking better but there are three issues that I don't understand in the compare output.Below are examples oof the issues. The report is attached.

I have deleted my entire database and reloded everything from scratch.
I am using the windows labkey installer version 2.3.
 

1) I have loaded three custom annotation files (gene name, swiss prot, and IPI). IPI is not working at all. I have attached the annotation file.

2) The Gene Name SERPINC1 does not link to IPI. See issue 2 below for fasta entries.

3) Locus NQO1_HUMAN does not link to gene name or IPI. See issue 3 below for fasta entries. Attached is the uniprot_sprot.xml entry that is also loaded.

Thanks,
Bill


Issue #2
=======================================================================================================================
>P01008|ANT3_HUMAN Antithrombin-III precursor - Homo sapiens (Human)
MYSNVIGTVTSGKRKVYLLSLLLIGFWDCVTCHGSPVDICTAKPRDIPMNPMCIYRSPEKKATEDEGSEQKIPEATNRRV
WELSKANSRFATTFYQHLADSKNDNDNIFLSPLSISTAFAMTKLGACNDTLQQLMEVFKFDTISEKTSDQIHFFFAKLNC
RLYRKANKSSKLVSANRLFGDKSLTFNETYQDISELVYGAKLQPLDFKENAEQSRAAINKWVSNKTEGRITDVIPSEAIN
ELTVLVLVNTIYFKGLWKSKFSPENTRKELFYKADGESCSASMMYQEGKFRYRRVAEGTQVLELPFKGDDITMVLILPKP
EKSLAKVEKELTPEVLQEWLDELEEMMLVVHMPRFRIEDGFSLKEQLQDMGLVDLFSPEKSKLPGIVAEGRDDLYVSDAF
HKAFLEVNEEGSEAAASTAVVIAGRSLNPNRVTFKANRPFLVFIREVPLNTIIFMGRVANPCVK

>IPI:IPI00032179.2|SWISS-PROT:P01008|TREMBL:Q5TC78;Q7KZ97;Q9UE54|ENSEMBL:ENSP00000356671|REFSEQ:NP_000479|H-INV:HIT000070696|VEGA:OTTHUMP00000035393 Tax_Id=9606 Gene_Symbol=SERPINC1 Antithrombin III variant
MYSNVIGTVTSGKRKVYLLSLLLIGFWDCVTCHGSPVDICTAKPRDIPMNPMCIYRSPEK
KATEDEGSEQKIPEATNNRRVWELSKANSRFATTFYQHLADSKNDNDNIFLSPLSISTAF
AMTKLGACNDTLQQLMEVFKFDTISEKTSDQIHFFFAKLNCRLYRKANKSSKLVSANRLF
GDKSLTFNETYQDISELVYGAKLQPLDFKENAEQSRAAINKWVSNKTEGRITDVIPSEAI
NELTVLVLVNTIYFKGLWKSKFSPENTRKELFYKADGESCSASMMYQEGKFRYRRVAEGT
QVLELPFKGDDITMVLILPKPEKSLAKVEKELTPEVLQEWLDELEEMMLVVHMPRFRIED
GFSLKEQLQDMGLVDLFSPEKSKLPGIVAEGRDDLYVSDAFHKAFLEVNEEGSEAAASTA
VVIAGRSLNPNRVTFKANMPFLVFIREVPLNTIIFMGRVANPCVK

>IPI:IPI00844156.2|TREMBL:Q7KYQ5;Q7KYY4;Q8IZZ8;Q8IZZ9;Q8J000;Q8J001;Q8TCE1;Q9UBW9|ENSEMBL:ENSP00000307953 Tax_Id=9606 Gene_Symbol=SERPINC1 SERPINC1 protein
MYSNVIGTVTSGKRKVYLLSLLLIGFWDCVTCHGSPVDICTAKPRDIPMNPMCIYRSPEK
KATEDEGSEQKIPEATNRRVWELSKANSRFATTFYQHLADSKNDNDNIFLSPLSISTAFA
MTKLGACNDTLQQLMEVFKFDTISEKTSDQIHFFFAKLNCRLYRKANKSSKLVSANRLFG
DKSLTFNDLYVSDAFHKAFLEVNEEGSEAAASTAVVIAGRSLNPNRVTFKANRPFLVFIR
EVPLNTIIFMGRVANPCVK

Issue #3
=======================================================================================================================

>P15559|NQO1_HUMAN NAD(P)H dehydrogenase [quinone] 1 - Homo sapiens (Human)
MVGRRALIVLAHSERTSFNYAMKEAAAAALKKKGWEVVESDLYAMNFNPIISRKDITGKLKDPANFQYPAESVLAYKEGH
LSPDIVAEQKKLEAADLVIFQFPLQWFGVPAILKGWFERVFIGEFAYTYAAMYDKGPFRSKKAVLSITTGGSGSMYSLQG
IHGDMNVILWPIQSGILHFCGFQVLEPQLTYSIGHTPADARIQILEGWKKRLENIWDETPLYFAPSSLFDLNFQAGFLMK
KEVQDEEKNKKFGLSVGHHLGKSIPTDNQIKARK

>IPI:IPI00012069.1|SWISS-PROT:P15559|TREMBL:Q53G81|ENSEMBL:ENSP00000319788|REFSEQ:NP_000894|H-INV:HIT000191221|VEGA:OTTHUMP00000081321;OTTHUMP00000174897 Tax_Id=9606 Gene_Symbol=NQO1 NAD
MVGRRALIVLAHSERTSFNYAMKEAAAAALKKKGWEVVESDLYAMNFNPIISRKDITGKL
KDPANFQYPAESVLAYKEGHLSPDIVAEQKKLEAADLVIFQFPLQWFGVPAILKGWFERV
FIGEFAYTYAAMYDKGPFRSKKAVLSITTGGSGSMYSLQGIHGDMNVILWPIQSGILHFC
GFQVLEPQLTYSIGHTPADARIQILEGWKKRLENIWDETPLYFAPSSLFDLNFQAGFLMK
KEVQDEEKNKKFGLSVGHHLGKSIPTDNQIKARK

Different Sequence, same gene name.
>IPI:IPI00619898.2|TREMBL:Q3B792|ENSEMBL:ENSP00000368334|REFSEQ:NP_001020605 Tax_Id=9606 Gene_Symbol=NQO1 NQO1 protein (Fragment)
MVGRRALIVLAHSERTSFNYAMKEAAAAALKKKGWEVVESDLYAMNFNPIISRKDITGKL
KDPANFQYPAESVLAYKEGHLSPDIVAEQKKLEAADLVIFQSKKAVLSITTGGSGSMYSL
QGIHGDMNVILWPIQSGILHFCGFQVLEPQLTYSIGHTPADARIQILEGWKKRLENIWDE
TPLYFAPSSLFDLNFQAGFLMKKEVQDEEKNKKFGLSVGHHLGKSIPTDNQKKKKKKKKK


======================================================================================
 
jeckels responded:  2008-03-11 15:52
Bill,

Thanks for plugging away.

1: When doing a custom annotation list based on IPI, don't include the version number of protein. There's a small note in the instructions that indicate that's the expected format, but we don't do a good job of telling you that you've uploaded an unsupported format. I've opened an issue, https://www.labkey.org/issues/home/Developer/issues/details.view?issueId=5380, to try to make this better in the future. Please let me know if the version information is an important matching criteria for you - in talking with some users it sounded like it was not needed for their cases, but that may not be representative.

2 and 3: It looks like we don't parse these FASTA headers correctly in 2.3. 8.1 already contains some improvements to handle them, and I've made additional edits to extract even more information. With my changes, all of the custom annotation information will join up correctly. I've added some of these headers to our test suite to make sure we keep parsing them correctly.

So, you should be able to make some more progress with 2.3, and 8.1 should fully address these issues.

Thanks,
Josh
 
wnels2 responded:  2008-03-12 08:01
Thanks for you help.

I thought the version number was the isoform so your right it shouldn't matter if it is included. It is working fine now.
-Bill