How to select text without the HTML markup

Question

I am working on a web scraper (using Python), so I have a chunk of HTML from which I am trying to extract text. One of the snippets looks something like this:

<p class="something">This class has some <strong>text</strong> and a few <a href="http://www.example.com">links</a> in it.</p>

I want to extract the text from this class. Now, I could use something along the lines of

//p[@class='something')]//text()

but this leads to each chunk of text ending up as a separate result element, like this:

(This class has some ,text, and a few ,links, in it.)

The desired output would contain all the text in one element, like this:

This class has some text and a few links in it.

Is there an easy or elegant way to achieve this?

Edit: Here's the code that produces the result given above.

from lxml import html

html_snippet = '<p class="something">This class has some <strong>text</strong> and a few <a href="http://www.example.com">links</a> in it.</p>'

xpath_query = "//p[@class='something']//text()"

tree = html.fromstring(html_snippet)
query_results = tree.xpath(xpath_query)
for item in query_results:
    print "'{0}'".format(item)

What HTML parsing library are you using? – alecxe Apr 01 '15 at 19:03 — alecxe, Apr 01 '15 at 19:03
I'm using lxml, I've updated the question. – Yuka Apr 01 '15 at 19:10 — Yuka, Apr 01 '15 at 19:10

score 3 · Answer 1 · answered Apr 01 '15 at 19:49

You can use normalize-space() in the XPath. Then

from lxml import html

html_snippet = '<p class="something">This class has some <strong>text</strong> and a few <a href="http://www.example.com">links</a> in it.</p>'
xpath_query = "normalize-space(//p[@class='something'])"

tree = html.fromstring(html_snippet)
print tree.xpath(xpath_query)

will yield

This class has some text and a few links in it.

score 1 · Accepted Answer · answered Apr 01 '15 at 19:49

You could call .text_content() on the lxml Element, instead of fetching the text with XPath.

from lxml import html

html_snippet = '<p class="something">This class has some <strong>text</strong> and a few <a href="http://www.example.com">links</a> in it.</p>'

xpath_query = "//p[@class='something']"

tree = html.fromstring(html_snippet)
query_results = tree.xpath(xpath_query)
for item in query_results:
    print "'{0}'".format(item.text_content())

score 0 · Answer 3 · answered Apr 01 '15 at 19:50

0

An alternate one-liner on your original code: use a join with an empty string separator:

print("".join(query_results))

answered Apr 01 '15 at 19:50

bjimba

928
8
13

How to select text without the HTML markup

3 Answers3