Importing dat file error: Too simple xml parsing

CPAS Forum (Inactive)
Importing dat file error: Too simple xml parsing aschwin.vanderwoude  2008-09-03 00:30
Status: Closed
 
Hi,

When trying to import a mascot dat file, I got the following error:

ERROR: XMLStreamException in hasNext()
com.ctc.wstx.exc.WstxParsingException: Unexpected close tag </search_hit>; expected </gamma>.
 at [row,col {unknown-source}]: [14607,18]
    at com.ctc.wstx.sr.StreamScanner.constructWfcException(StreamScanner.java:605)
    at com.ctc.wstx.sr.StreamScanner.throwParseError(StreamScanner.java:461)
    at com.ctc.wstx.sr.BasicStreamReader.reportWrongEndElem(BasicStreamReader.java:3256)
    at com.ctc.wstx.sr.BasicStreamReader.readEndElem(BasicStreamReader.java:3198)
    at com.ctc.wstx.sr.BasicStreamReader.nextFromTree(BasicStreamReader.java:2830)
    at com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1019)
    at org.labkey.common.tools.XMLStreamReaderWrapper.next(XMLStreamReaderWrapper.java:246)
    at org.labkey.common.tools.SimpleXMLStreamReader.skipTo(SimpleXMLStreamReader.java:78)
    at org.labkey.common.tools.SimpleXMLStreamReader.skipToStart(SimpleXMLStreamReader.java:64)
    at org.labkey.common.tools.PepXmlLoader$FractionIterator.hasNext(PepXmlLoader.java:101)
    at org.labkey.ms2.PepXmlImporter.importRun(PepXmlImporter.java:88)
    at org.labkey.ms2.MS2Importer.upload(MS2Importer.java:181)
    at org.labkey.ms2.MS2Manager.importRun(MS2Manager.java:426)
    at org.labkey.ms2.pipeline.MS2ImportPipelineJob.run(MS2ImportPipelineJob.java:85)
    at org.labkey.ms2.pipeline.mascot.MascotImportPipelineJob.run(MascotImportPipelineJob.java:159)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:417)
    at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:269)
    at java.util.concurrent.FutureTask.run(FutureTask.java:123)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:65)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:168)
    at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:650)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:675)
    at java.lang.Thread.run(Thread.java:595)

This would indicate a problem with the XML file, so here is a snippet of the pepXML file:

    <spectrum_query spectrum="Wed_Sep__3_10-16-48_2008.spectrum1640.0000.0000.2" start_scan="0000" end_scan="0000" precursor_neutral_mass="910.4860" assumed_charge="2" index="1319">
    <search_result>
      <search_hit hit_rank="1" peptide="EPWAPSPQ" peptide_prev_aa="R" peptide_next_aa="-" protein="A0N5T0" num_tot_proteins="1" num_matched_ions="6" tot_num_ions="14" calc_neutral_pep_mass="910.4185" massdiff="+0.0675" num_tol_term="2" num_missed_cleavages="0" is_rejected="0" protein_descr="V<gamma>1 protein (Fragment) OS=Homo sapiens GN=V<gamma>1 PE=4 SV=1">
        <search_score name="ionscore" value="4.17"/>
        <search_score name="identityscore" value="33.17"/>
        <search_score name="star" value="1"/>
        <search_score name="homologyscore" value="17.02"/>
        <search_score name="expect" value="39.76"/>
      </search_hit>
    </search_result>
    </spectrum_query>

It seems pepXML does generate a well-formed XML file, but the pepXML parser in Labkey-CPAS uses a too naive parsing algorithm that isn't XML-compliant, as the "<gamma>" tags are in between quotation marks and thus are not part of the XML markup.
The stacktrace does indicate a custom parser is used instead of a proper XML parser, which confirms my conclusion.

The job itself is flagged as COMPLETE, even though this is clearly a fatal error.

-Aschwin
 
 
aschwin.vanderwoude responded:  2008-09-03 01:13
Hmm xmllint also complained about the XML file, so apparently the "<>" characters aren't allowed to be used within values of attributes. Or perhaps some parsers are more strict than others, although in my opinion an xml parser should allow for such characters in places where they don't have meaning.

Any way, the problem can be solved, until a proper fix exists, by using the following wrapper script for Mascot2XML, when running Labkey on Linux.

#!/bin/sh
Mascot2XML.bin $@
FILE=`echo $1|sed -e 's/.dat/.xml/'`
OLDFILE=`echo "$FILE.old"`
mv $FILE $OLDFILE
cat $OLDFILE|sed -r -e 's/<gamma>|<kappa>/ /g' > $FILE
rm $OLDFILE
 
aschwin.vanderwoude responded:  2008-09-03 01:59
Although the run completed, and the log doesn't present any sort of trouble, the data doesn't show up in the "MS2 run" list on the MS2 dashboard.

If I do "Process and Import data" and perform "Import peptides" one of dat.files, I am presented with all the MS2 runs importing dat files. Each of them seem to present me with the mascot data, but I cannot compare between them and ProteinProphet data is unavailable as well.

Is mascot dat-file import non-functional in general at the moment?

-Aschwin
 
jeckels responded:  2008-09-05 09:35
Hi Aschwin,

I believe that the CPAS XML parser is correctly enforcing the XML spec by requiring that '<' be encoded in an attribute value, though it does seem like that might not be truly necessary.

If you're directly importing MS2 search results, you will unfortunately need to go to your folder's main portal page and add the 'MS2 Runs' web part. The default list of MS2 runs is actually the 'MS2 Runs (Enhanced)' which adds functionality but only shows runs that include experimental metadata, which the CPAS pipeline automatically creates. I know this is confusing and it's been a known issue for some time but it hasn't bubbled to the top of any of our clients' priority lists.

Thanks,
Josh