I'm parsing a page that has structure like this:
<pre class="asdf">content a</pre>
<pre class="asdf">content b</pre>
# returns
content a
content b
And I'm using the following XPath to get the content:
"//pre[@class='asdf']/text()"
It works well, except if there are any elements nested inside the <pre>
tag, where it doesn't concatenate them:
<pre class="asdf">content <a href="http://stackoverflow.com"</a>a</a></pre>
<pre class="asdf">content b</pre>
# returns
content
content b
If I use this XPath, I get the output that follows.
"//pre[@class='asdf']//text()"
content
a
content b
I don't want either of those. I want to get all text inside a <pre>
, even if it has children. I don't care if the tags are stripped or not- but I want it concatenated together.
How do I do this? I'm using lxml.html.xpath
in python2, but I don't think it matters. This answer to another question makes me think that maybe child::
has something to do with my answer.
Here's some code that reproduces it.
from lxml import html
tree = html.fromstring("""
<pre class="asdf">content <a href="http://stackoverflow.com">a</a></pre>
<pre class="asdf">content b</pre>
""")
for row in tree.xpath("//*[@class='asdf']/text()"):
print("row: ", row)