2

I am parsing a XML file, downloaded from internet, using lxml. It has a structure something similar to this:

<root>
    <a>Some text in A node</a>
    <b><c>Some text in C node</c>Some text in B node</b>
</root>

I want to print the text inside the nodes with the following piece of code:

from lxml import etree
doc = etree.parse('some.xml')
root = doc.getroot()
for ch in root:
    print ch.text

Output

Some text in A node
None

This is not printing the text for <B>. Why? When I change the XML (shown below), text first and then child nodes, I get the correct output. Is it something to do with the XML syntax or lxml? Since I cannot control the XML because it is directly downloaded from the internet, I need a way to get the text as it is in the previous format.

<root>
    <a>Some text in A node</a>
    <b>Some text in B node<c>Some text in C node</c></b>
</root>

Output

Some text in A node
Some text in B node
falsetru
  • 357,413
  • 63
  • 732
  • 636
sk11
  • 1,779
  • 1
  • 17
  • 29

1 Answers1

3

According to lxml.etree._Element documentation:

text property returns a text before the first subelement. This is either a string or the value None, if there was no text.

To print any first text in the tag, try following which use xpath to get child text node:

for ch in root:
    print next((x for x in ch.xpath('text()')), None)

or:

for ch in root.xpath('/text()'):
    print ch
falsetru
  • 357,413
  • 63
  • 732
  • 636
  • Yes true, the documentation says it. Using `xpath()` gets the text no matter where it is placed. Thanks. – sk11 Sep 02 '14 at 09:22