I am working on a web scraper (using Python), so I have a chunk of HTML from which I am trying to extract text. One of the snippets looks something like this:
<p class="something">This class has some <strong>text</strong> and a few <a href="http://www.example.com">links</a> in it.</p>
I want to extract the text from this class. Now, I could use something along the lines of
//p[@class='something')]//text()
but this leads to each chunk of text ending up as a separate result element, like this:
(This class has some ,text, and a few ,links, in it.)
The desired output would contain all the text in one element, like this:
This class has some text and a few links in it.
Is there an easy or elegant way to achieve this?
Edit: Here's the code that produces the result given above.
from lxml import html
html_snippet = '<p class="something">This class has some <strong>text</strong> and a few <a href="http://www.example.com">links</a> in it.</p>'
xpath_query = "//p[@class='something']//text()"
tree = html.fromstring(html_snippet)
query_results = tree.xpath(xpath_query)
for item in query_results:
print "'{0}'".format(item)