2

I am working on a web scraper (using Python), so I have a chunk of HTML from which I am trying to extract text. One of the snippets looks something like this:

<p class="something">This class has some <strong>text</strong> and a few <a href="http://www.example.com">links</a> in it.</p>

I want to extract the text from this class. Now, I could use something along the lines of

//p[@class='something')]//text()

but this leads to each chunk of text ending up as a separate result element, like this:

(This class has some ,text, and a few ,links, in it.)

The desired output would contain all the text in one element, like this:

This class has some text and a few links in it.

Is there an easy or elegant way to achieve this?

Edit: Here's the code that produces the result given above.

from lxml import html

html_snippet = '<p class="something">This class has some <strong>text</strong> and a few <a href="http://www.example.com">links</a> in it.</p>'

xpath_query = "//p[@class='something']//text()"

tree = html.fromstring(html_snippet)
query_results = tree.xpath(xpath_query)
for item in query_results:
    print "'{0}'".format(item)
kjhughes
  • 106,133
  • 27
  • 181
  • 240
Yuka
  • 473
  • 2
  • 10

3 Answers3

3

You can use normalize-space() in the XPath. Then

from lxml import html

html_snippet = '<p class="something">This class has some <strong>text</strong> and a few <a href="http://www.example.com">links</a> in it.</p>'
xpath_query = "normalize-space(//p[@class='something'])"

tree = html.fromstring(html_snippet)
print tree.xpath(xpath_query)

will yield

This class has some text and a few links in it.
kjhughes
  • 106,133
  • 27
  • 181
  • 240
1

You could call .text_content() on the lxml Element, instead of fetching the text with XPath.

from lxml import html

html_snippet = '<p class="something">This class has some <strong>text</strong> and a few <a href="http://www.example.com">links</a> in it.</p>'

xpath_query = "//p[@class='something']"

tree = html.fromstring(html_snippet)
query_results = tree.xpath(xpath_query)
for item in query_results:
    print "'{0}'".format(item.text_content())
Robᵩ
  • 163,533
  • 20
  • 239
  • 308
0

An alternate one-liner on your original code: use a join with an empty string separator:

print("".join(query_results))
bjimba
  • 928
  • 8
  • 13