0

I am trying to parse a XML file with Python 2.7

Here is the XML file I am using:

<NS:Member>
<NS:Area fid='120410'>
<NS:Code>10021</NS:Code>
<NS:version>4</NS:version>
<NS:versionDate>2004-03-29</NS:versionDate>
<NS:theme>Buildings</NS:theme>
<NS:Value>42.826432</NS:Value>
<NS:changeHistory>
    <NS:changeDate>2002-09-26</NS:changeDate>
    <NS:reasonForChange>New</NS:reasonForChange>
</NS:changeHistory>
<NS:changeHistory>
    <NS:changeDate>2003-10-24</NS:changeDate>
    <NS:reasonForChange>Attributes</NS:reasonForChange>
</NS:changeHistory>
<NS:changeHistory>
    <NS:changeDate>2004-03-18</NS:changeDate>
    <NS:reasonForChange>Attributes</NS:reasonForChange>
</NS:changeHistory>
<NS:Group>Building</NS:Group>
<NS:make>Manmade</NS:make>
<NS:Level>50</NS:Level>
<NS:polygon>
    <NS2:Polygon srsName='NS2:BNG'>
    <NS2:Boundary>
        <NS2:LinearRing>
            <NS2:coordinates>383415.110,400491.900 383411.090,400485.570 383415.500,400482.770 383420.430,400490.530 383418.780,400491.580 383417.930,400490.240 383415.160,400491.980 383415.110,400491.900 
            </NS2:coordinates>
        </NS2:LinearRing>
    </NS2:Boundary>
    </NS2:Polygon>
</NS:polygon></NS:Area>
</NS:Member>

I am only interested at the ID, Group, make and coordinates part in the XML file.

And the code I use is:

import xml.sax

class MyHandler(xml.sax.ContentHandler):
    
    def __init__(self):
        self.__CurrentData = ""
        self.__ID = ""
        self.__Group = ""
        self.__make = ""
        self.__coordinates = []
        self.__coordString = ""
        
        
    def startElement(self, tag, attributes):
        self.__CurrentData = tag
        if tag == "NS:Area":
            self.__ID = attributes["fid"]
            print "ID: ", self.__ID
                           
            
    def endElement(self, tag):
        if self.__CurrentData == "NS:Group":
            print "Group: ", self.__Group
            
        elif self.__CurrentData == "NS:make":
            print "Make: ", self.__make
                                
        elif self.__CurrentData == "NS2:coordinates":
            print "coordinates: ", self.__coordString
                                
        self.__CurrentData = ""
        
            
    def characters(self, content):
        if self.__CurrentData == "NS:Area":
            self.__ID = content
        elif self.__CurrentData == "NS:Group":
            self.__Group = content
        elif self.__CurrentData == "NS:make":
            self.__make = content
        elif self.__CurrentData == "NS2:coordinates":
            self.__coordString = content

I expected to see the out put as follows:

ID: 120410

Group: Building

Make: Manmade

coordinates: 383415.110,400491.900 383411.090,400485.570 383415.500,400482.770 383420.430,400490.530 383418.780,400491.580 383417.930,400490.240 383415.160,400491.980 383415.110,400491.900

However, what I've got is:

ID: 120410

Group: Building

Make: Manmade

coordinates:

where the coordinates are missing and being replaced by a log of spaces.

May I know what is wrong with my code?

Many thanks.

Community
  • 1
  • 1
ChangeMyName
  • 7,018
  • 14
  • 56
  • 93

2 Answers2

0

You cannot read the content off the child tags properly with your method, which is where your coordinates content is to be found. I would recommend a DOM-type parser (I like lxml personally ) instead of the one you are using as it will greatly simplify this task for you due to it's tracking of relationships between tag elements, but I can elaborate on what you would have to implement to handle this in your current parser.

In order to do so what you need to do is write a startElement() handler that raises a flag when it sees a <parent> tag and an endElement() stopping the flag when you reach the closing tag. Then the startElement() handler must see all the tags while the flag is raised. The basic framework you must implement in your code will look something like this.

class SaxwithParentChilds(handler.ContentHandler):

    def __init__(self):
        self.parentflag = False
        self.childlist  = []

    def startElement(self, name, att):
        if name == "parent":
           self.parentflag = True
        elif self.parentflag:
           self.childlist.append(name)

    def endElement(self, name ):
        if name == "parent":
            self.parentflag = False 
HavelTheGreat
  • 3,299
  • 2
  • 15
  • 34
  • Hi, Thanks for the answer. I've actually use my code to extract the coordinates. But just for some of the blocks that between a pair of and , the code crash and cannot read complete content. – ChangeMyName Feb 11 '15 at 18:02
  • Is there a helpful error message produced or anything? What did you change? – HavelTheGreat Feb 11 '15 at 18:16
  • @Elision The code that can extract the coordinates are posted at: http://stackoverflow.com/questions/28457426/parse-xml-file-with-namespaces-in-python-using-xml-sax. I've changed nothing in the code, but I've deleted all the blocks and retain the one which crashed my code. Then I found my code is not crashing anymore, but it cannot read coordinates anymore. – ChangeMyName Feb 12 '15 at 09:08
  • I tried to explicitly output the `self.__coordString` in the `endElement()`. It turns our that the `self.__coordString` has been output twice. The first time it was correct, and the second time it just read part of the coordinates, and the partly read coordinates is not from this block. – ChangeMyName Feb 12 '15 at 09:21
0

All

Thanks for your help.

I just figured out what is going on, and it is simply because of the mis-alignment of the data file. It turns out that the </NS2:coordinates> should be right next to the end of the coordinates, rather than in a new row.

Hope this can help other people who has the same problem.

ChangeMyName
  • 7,018
  • 14
  • 56
  • 93