2

I am trying to parse a xml file and arrange it into a table separating the contents as isElement, isAttribute, Value, Text.

How do I use ElementTree module to achieve this? I know this is possible using the minidom module.

The reason I want to use ElementTree is due to is effencicy. An exmaple of what I am trying to achive is available here: http://python.zirael.org/e-gtk-treeview4.html

Any advice on how to seprate the xml contents into element, subelemnt etc. using the ElementTree module?

This is what I have so far:

import xml.etree.cElementTree as ET

filetree = ET.ElementTree(file = "some_file.xml")
for child in filetree.iter():
     print child.tag, child.text, child.attrib

For the following example xml file:

    <?xml version="1.0"?>
    <data>
        <country name="Liechtenstein">
            <rank>1</rank>
            <year>2008</year>
            <gdppc>141100</gdppc>
            <neighbor name="Austria" direction="E"/>
            <neighbor name="Switzerland" direction="W"/>
        </country>
        <country name="Singapore">
            <rank>4</rank>
            <year>2011</year>
            <gdppc>59900</gdppc>
            <neighbor name="Malaysia" direction="N"/>
        </country>
        <country name="Panama">
            <rank>68</rank>
            <year>2011</year>
            <gdppc>13600</gdppc>
            <neighbor name="Costa Rica" direction="W"/>
            <neighbor name="Colombia" direction="E"/>
        </country>
    </data>

I get this as output:

    data 
         {}
    country 
             {'name': 'Liechtenstein'}
    rank 1 {}
    year 2008 {}
    gdppc 141100 {}
    neighbor None {'direction': 'E', 'name': 'Austria'}
    neighbor None {'direction': 'W', 'name': 'Switzerland'}
    country 
             {'name': 'Singapore'}
    rank 4 {}
    year 2011 {}
    gdppc 59900 {}
    neighbor None {'direction': 'N', 'name': 'Malaysia'}
    country 
             {'name': 'Panama'}
    rank 68 {}
    year 2011 {}
    gdppc 13600 {}
    neighbor None {'direction': 'W', 'name': 'Costa Rica'}
    neighbor None {'direction': 'E', 'name': 'Colombia'}

I did find something simialr on another post but it uses the DOM module. Walk through all XML nodes in an element-nested structure

Based on the comment received, this is what I want to achieve:

    data (type Element)
         country(Element)
              Text = None
              name(Attribute)
                 value: Liechtenstein
              rank(Element)
                  Text = 1
              year(Element)
                  Text = 2008
              gdppc(Element)
                  Text = 141100
              neighbour(Element)
                  name(Attribute)
                      value: Austria
                  direction(Attribute)
                      value: E
              neighbour(Element)
                  name(Attribute)
                      value: Switzerland
                  direction(Attribute)
                      value: W

         country(Element)
              Text = None
              name(Attribute)
                 value: Singapore
              rank(Element)
                  Text = 4

I want to be able to presente my data in a tree like structure as above. To do this I need to keeep track of their relationship. Hope this clarifies the question.

Cœur
  • 37,241
  • 25
  • 195
  • 267
Saed
  • 357
  • 1
  • 7
  • 12
  • Please see the edited post for the code. – Saed Sep 02 '15 at 14:22
  • Like in the Gtk example code you'll have to write a recursive function/method that adds each node in the XML document to the `TreeStore`. There is a difference in how `ElementTree` handles text: it's not a special node type but each element has a `text` and a `tail` attribute. – BlackJack Sep 02 '15 at 14:58
  • What is wrong with the code you posted? What did you intend it to do that it doesn't? – DisappointedByUnaccountableMod Sep 02 '15 at 18:45
  • @barny In the code I posted, there is no way to track for example: if the element is a sub-element of another the previous etc..Basically the hierarchy is not clear using the above code. – Saed Sep 03 '15 at 08:49
  • I'm a bit confused because your question said you wanted to flatten out the xml, in particular there's no mention of child/parent. Can you describe more clearly what you want to achieve? – DisappointedByUnaccountableMod Sep 03 '15 at 10:07
  • @barny Please see the modified question. I have added the final output I am after. Thanks. I think like BlackJack said, a recursive function is what I need. Any idea if thee is a inbult function in ElementTree to inform the user about the number or attribiutes, elements etc availabe? – Saed Sep 03 '15 at 14:43
  • Have you tried searching? http://stackoverflow.com/questions/17310681/how-to-iterate-through-every-element-of-a-complicated-xml-tree and look under heading Watching Events While Parsing here https://pymotw.com/2/xml/etree/ElementTree/parse.html I searched for: python elementtree xml print nested – DisappointedByUnaccountableMod Sep 03 '15 at 14:49
  • What about reading the [documentation of the `ElementTree` module](https://docs.python.org/2/library/xml.etree.elementtree.html#element-objects)? `Element` objects are sequences containing their direct child elements, XML attributes are stored in a dictionary mapping attribute names to values. The dictionary is an attribute called `attrib` on `Element` objects. Both sequences and dictionaries support the `len()` function to find out the number of items. – BlackJack Sep 03 '15 at 14:53
  • @BlackJack Thanks.. will look into this. – Saed Sep 03 '15 at 15:27
  • @barny Thanks. I will have a read through these documentations and see how far I get. – Saed Sep 03 '15 at 15:28
  • one thing that might help you: the built-in ElementTree module has no concept of parents, but if you use the lxml module (available on pypi), it's very similar to ElementTree (even api compatible for the most part), except that lxml Elements do know who their parent is, and you can walk back up the tree from anywhere. – Corley Brigman Sep 03 '15 at 15:39

2 Answers2

1

Element objects are sequences containing their direct child elements. XML attributes are stored in a dictionary mapping attribute names to values. There are no text nodes as in DOM. Text ist stored as text and tail attributes. Text within the element but before the first subelement is stored in text and text between that element and the next one is stored in tail. So if we take the gtk-treeview4-2.py example from TreeView IV. - display of trees we have to rewrite this DOM code:

# ...
import xml.dom.minidom as dom
# ...

    def create_interior(self):
        # ...
        doc = dom.parse(self.filename)
        self.add_element_to_treestore(doc.childNodes[0], None)
        # ...

    def add_element_to_treestore(self, e, parent):
        if isinstance(e, dom.Element):
            me = self.model.append(parent, [e.nodeName, 'ELEMENT', ''])
            for i in range(e.attributes.length):
                a = e.attributes.item(i)
                self.model.append(me, ['@' + a.name, 'ATTRIBUTE', a.value])
            for ch in e.childNodes:
                self.add_element_to_treestore(ch, me)
        elif isinstance(e, dom.Text):
            self.model.append(
                parent, ['text()', 'TEXT_NODE', e.nodeValue.strip()])

by the following using ElementTree:

# ...
from xml.etree import ElementTree as etree
# ...

    def create_interior(self):
        # ...
        doc = etree.parse(self.filename)
        self.add_element_to_treestore(doc.getroot())
        # ...

    def add_element_to_treestore(self, element, parent=None):
        path = self.model.append(parent, [element.tag, 'ELEMENT', ''])
        for name, value in sorted(element.attrib.iteritems()):
            self.model.append(path, ['@' + name, 'ATTRIBUTE', value])
        if element.text:
            self.model.append(
                path, ['text()', 'TEXT_NODE', element.text.strip()]
            )
        for child in element:
            self.add_element_to_treestore(child, path)
            if element.tail:
                self.model.append(
                    path, ['text()', 'TEXT_NODE', element.tail.strip()]
                )

Screenshot with your example data and the first subtree fully expanded:

Screenshot of exampla data


Update: Screenshot of example data and relevant import lines in code added.

BlackJack
  • 4,476
  • 1
  • 20
  • 25
  • Thabks for this. This is kind of what I was after. I am accepting this as the answer. – Saed Sep 08 '15 at 11:20
0

Possibly not exactly what you need but you can transform the XML with an XSLT to achieve a tree-like structure:

XSLT (tabs and line breaks included)

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output version="1.0" encoding="UTF-8"/>

<xsl:template match="data">

<xsl:variable name="tabonce"><xsl:text>&#10;&#x9;</xsl:text></xsl:variable>
<xsl:variable name="tabtwice"><xsl:text>&#10;&#x9;&#x9;</xsl:text></xsl:variable>

<data>
    data (type Element)<xsl:text>&#10;&#x9;</xsl:text>
    <xsl:for-each select="country">
           <xsl:value-of select="concat(local-name(.), '(Element)')"/>
           Text = <xsl:value-of select="concat('None', $tabonce)"/> 
           <xsl:value-of select="concat(name(@*), '(Attribute)')"/>
              value: <xsl:value-of select="concat(@*, $tabonce)"/>          

        <xsl:for-each select="*">
        <xsl:value-of select="concat(local-name(.), '(Element)')"/>     
              Text = <xsl:value-of select="concat(., $tabonce)"/> 

              <xsl:if test="@*">
                 <xsl:text>&#x9;</xsl:text><xsl:value-of select="concat(name(@name), '(Attribute)')"/>
                 value: <xsl:value-of select="concat(@name, $tabtwice)"/>  
                 <xsl:value-of select="concat(name(@direction), '(Attribute)')"/>
                 value: <xsl:value-of select="concat(@direction, $tabonce)"/> 
              </xsl:if>

        </xsl:for-each>
        <xsl:text>&#10;&#x9;</xsl:text>

    </xsl:for-each>
    <xsl:text>&#10;</xsl:text>
</data>    

</xsl:template>
</xsl:stylesheet>

Python script using lxml module:

import lxml.etree as ET

dom = ET.parse('C:\Path\To\XMLfile.xml')
xslt = ET.parse('C:\Path\To\XSLfile.xsl')
transform = ET.XSLT(xslt)
newdom = transform(dom)

tree_out = ET.tostring(newdom, encoding='UTF-8', pretty_print=True,  xml_declaration=True)
print(tree_out)

xmlfile = open('C:\Path\To\OutputPath.xml','wb')
xmlfile.write(tree_out)
xmlfile.close()

XML Output

<?xml version='1.0' encoding='UTF-8'?>
<data>
    data (type Element)
    country(Element)
        Text = None
    name(Attribute)
        value: Liechtenstein
    rank(Element)       
        Text = 1
    year(Element)       
        Text = 2008
    gdppc(Element)      
        Text = 141100
    neighbor(Element)       
        Text = 
        name(Attribute)
            value: Austria
        direction(Attribute)
            value: E
    neighbor(Element)       
        Text = 
        name(Attribute)
            value: Switzerland
        direction(Attribute)
            value: W

    country(Element)
        Text = None
    name(Attribute)
        value: Singapore
    rank(Element)       
        Text = 4
    year(Element)       
        Text = 2011
    gdppc(Element)      
        Text = 59900
    neighbor(Element)       
        Text = 
        name(Attribute)
            value: Malaysia
        direction(Attribute)
            value: N

    country(Element)
        Text = None
    name(Attribute)
        value: Panama
    rank(Element)       
        Text = 68
    year(Element)       
        Text = 2011
    gdppc(Element)      
        Text = 13600
    neighbor(Element)       
        Text = 
        name(Attribute)
            value: Costa Rica
        direction(Attribute)
            value: W
    neighbor(Element)       
        Text = 
        name(Attribute)
            value: Colombia
        direction(Attribute)
            value: E


</data>
Parfait
  • 104,375
  • 17
  • 94
  • 125
  • Thanks for this answer. I wasn't aware of this approach, now I know :) However, I have accepted the answer below as it is more relevent reply to the original question. – Saed Sep 08 '15 at 11:22