2

I am looking for solution to my problem related to XML in python. Though spectrum is not a root element let's suppose it's for this example.

<spectrum index="2" id="controller=0 scan=3" defaultArrayLength="485">
          <cvParam cvRef="MS" accession="MS:1000511" name="ms level" value="2"/>
          <cvParam cvRef="MS" accession="MS:1000580" name="MSn spectrum" value=""/>
          <cvParam cvRef="MS" accession="MS:1000127" name="centroid mass spectrum" value=""/>
          <precursorList count="1">
            <precursor spectrumRef="controller=0 scan=2">
              <isolationWindow>
                <cvParam cvRef="MS" accession="MS:1000040" name="m/z" value="810.78999999999996"/>
                <cvParam cvRef="MS" accession="MS:1000023" name="isolation width" value="2"/>
              </isolationWindow>
              <selectedIonList count="1">
                <selectedIon>
                  <cvParam cvRef="MS" accession="MS:1000040" name="m/z" value="810.78999999999996"/>
                </selectedIon>
              </selectedIonList>
              <activation>
                <cvParam cvRef="MS" accession="MS:1000133" name="collision-induced dissociation" value=""/>
                <cvParam cvRef="MS" accession="MS:1000045" name="collision energy" value="35"/>
              </activation>
            </precursor>
          </precursorList>
          <binaryDataArrayList count="2">
            <binaryDataArray encodedLength="5176">
              <cvParam cvRef="MS" accession="MS:1000523" name="64-bit float" value=""/>
              <cvParam cvRef="MS" accession="MS:1000576" name="no compression" value=""/>
              <cvParam cvRef="MS" accession="MS:1000514" name="m/z array" value="" unitCvRef="MS" unitAccession="MS:1000040" unitName="m/z"/>
              <binary>AAAAYHHsbEAAAADg3yptQAAAAECt7G1AAAAAAN8JbkAAAAAA.......hLJ==</binary>
            </binaryDataArray>
            <binaryDataArray encodedLength="2588">
              <cvParam cvRef="MS" accession="MS:1000521" name="32-bit float" value=""/>
              <cvParam cvRef="MS" accession="MS:1000576" name="no compression" value=""/>
              <cvParam cvRef="MS" accession="MS:1000515" name="intensity array" value=""/>
              <binary>ZFzUQWmVo0FH/o9BRfUyQg+xjUOzkZdC5k66QWk6HUSpqyZCsV1NQ......uH=</binary>
            </binaryDataArray>
          </binaryDataArrayList>
</spectrum>

What I am trying to achieve is find all selectedIon element in the tree and backtrack it's parent element spectrum. If selectedIon element is found then

SelectedIon information:


Mass: 810.78999999999996

Spectra Info:
-------------
index=2
id=controller=0
scan=3
length=485

General Info
------------
ms level=2
Msn spectrum= -
centriod mass spectrum=-
.....................
And all the cvParam name and value as above. 

Binary
------
m/z array = AAAAYHHsbEAAAADg3yptQAAAAECt7G1AAAA.....== 

intensity array = ZFzUQWmVo0FH/o9BRfUyQg+xjUOzkZdC5k66Q....5C77=

What I have tried so far:

import xml.etree.ElementTree as ET
tree=ET.parse('file.mzml')
NS="{http://psi.hupo.org/ms/mzml}"
filesource=tree.findall('.//'+NS+'selectedIon') # Will get all selectedIon element from the tree

Now how can I backtrace to spectrum element/subelement to parse out relevant information as above?

How can I success?

thchand
  • 358
  • 2
  • 8
  • 20
  • Why don't you go the other way? i.e. Loop over spectrum elements and output if it has a selectedIon element. – Avaris Sep 25 '11 at 17:06
  • I am trying to parse only the spectrum element that has selectedIon. Going other way will load all spectrum element which mayn't have selectedIon. – thchand Sep 25 '11 at 17:19
  • Sure, but if that's the case you can skip that spectrum element and go to the next one. – Avaris Sep 25 '11 at 17:30

2 Answers2

1

XPath will let you access an ancestor: "ancestor::spectrum" will return the <spectrum> element you are contained within. If you use lxml, you can use full XPath syntax to find elements you want.

from lxml import etree
tree = etree.XML('file.mzml')
NS = "{http://psi.hupo.org/ms/mzml}"
filesource = tree.findall('.//'+NS+'selectedIon')
spectrum = filesource.xpath('ancestor::spectrum')[0]

(I think, not tested...)

UPDATED: code that actually works:

from lxml import etree

tree = etree.parse('foo.xml')
for el in tree.findall(".//selectedIon"):
    for top in el.xpath("ancestor::spectrum"):
        print top
Ned Batchelder
  • 364,293
  • 75
  • 561
  • 662
  • 1
    Uhmm, xpath is one based, `xpath('ancestor::spectrum')[1]`. You could also select all spectrum that has selectedIon children directly: `//spectrum[.//selectedIon]` – forty-two Sep 25 '11 at 17:30
  • I think filesource = tree.findall('.//'+NS+'selectedIon') creates list and list has no attribute xpath – thchand Sep 25 '11 at 17:34
  • @ forty-two : //spectrum[.//selectedIon] expression will select all selectedIon. But the same question how to parse information of spectrum and other spectrum element? – thchand Sep 25 '11 at 17:37
  • @forty-two: XPath is 1-based, but the list returned by it is a Python list, so you need to index it at [0] to get the first (and only) element from it. – Ned Batchelder Sep 25 '11 at 21:10
  • @thchand: I've added a code snippet that works. You'll need to deal with the namespaces yourself. – Ned Batchelder Sep 25 '11 at 21:21
  • But, how can I deal with namespace in el.xpath("ancestor::spectrum") ? – thchand Oct 04 '11 at 15:10
0

If this is still a current issue, you might try pymzML, a python interface to mzML files.

Printing all information from all MS2 spectra is just as easy as:

import pymzml
msrun = pymzml.run.Reader("your-file.mzML")
for spectrum in msrun:
    if spectrum['ms level'] == 2:
        # spectrum is a dict, so you can just print it        
        print(spectrum)

(Disclosure: I'm one of the authors)

user1251007
  • 15,891
  • 14
  • 50
  • 76