1

I'm parsing a page that has structure like this:

<pre class="asdf">content a</pre>
<pre class="asdf">content b</pre>

# returns
content a
content b

And I'm using the following XPath to get the content: "//pre[@class='asdf']/text()"

It works well, except if there are any elements nested inside the <pre> tag, where it doesn't concatenate them:

<pre class="asdf">content <a href="http://stackoverflow.com"</a>a</a></pre>
<pre class="asdf">content b</pre>

# returns
content
content b

If I use this XPath, I get the output that follows. "//pre[@class='asdf']//text()"

content
a
content b

I don't want either of those. I want to get all text inside a <pre>, even if it has children. I don't care if the tags are stripped or not- but I want it concatenated together.

How do I do this? I'm using lxml.html.xpath in python2, but I don't think it matters. This answer to another question makes me think that maybe child:: has something to do with my answer.

Here's some code that reproduces it.

from lxml import html

tree = html.fromstring("""
<pre class="asdf">content <a href="http://stackoverflow.com">a</a></pre>
<pre class="asdf">content b</pre>
""")
for row in tree.xpath("//*[@class='asdf']/text()"):
  print("row: ", row)
Community
  • 1
  • 1
tedder42
  • 23,519
  • 13
  • 86
  • 102

1 Answers1

1

.text_content() is what you should use:

.text_content(): Returns the text content of the element, including the text content of its children, with no markup.

for row in tree.xpath("//*[@class='asdf']"):
    print("row: ", row.text_content())

Demo:

>>> from lxml import html
>>> 
>>> tree = html.fromstring("""
... <pre class="asdf">content <a href="http://stackoverflow.com">a</a></pre>
... <pre class="asdf">content b</pre>
... """)
>>> for row in tree.xpath("//*[@class='asdf']"):
...     print("row: ", row.text_content())
... 
('row: ', 'content a')
('row: ', 'content b')
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • 1
    you're entirely right. I was focused on the xpath. Glad I included a demo program rather than making this solely an xpath question! – tedder42 Jan 29 '16 at 05:34
  • 1
    @tedder42 thanks for the minimal sample to demonstrate the problem. It is a pleasure to answer clear and well-explained questions like this one. – alecxe Jan 29 '16 at 05:36