4

I want to parse an HTML document like this with requests-html 0.9.0:

from requests_html import HTML
html = HTML(html='<span><span class="data">important data</span> and some rubbish</span>')
data = html.find('.data', first=True)
print(data.html)
# <span class="data">important data</span> and some rubbish
print(data.text)
# important data and some rubbish

I need to distinguish the text inside the tag (enclosed by it) from the tag's tail (the text that follows the element up to the next tag). This is the behaviour I initially expected:

data.text == 'important data'
data.tail == ' and some rubbish'

But tail is not defined for Elements. Since requests-html provides access to inner lxml objects, we can try to get it from lxml.etree.Element.tail:

from lxml.etree import tostring
print(tostring(data.lxml))
# b'<html><span class="data">important data</span></html>'
print(data.lxml.tail is None)
# True

There's no tail in lxml representation! The tag with its inner text is OK, but the tail seems to be stripped away. How do I extract 'and some rubbish'?

Edit: I discovered that full_text provides the inner text only (so much for “full”). This enables a dirty hack of subtracting full_text from text, although I'm not positive it will work if there are any links.

print(data.full_text)
# important data
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
Norrius
  • 7,558
  • 5
  • 40
  • 49

2 Answers2

3

I'm not sure I've understood your problem, but if you just want to get 'and some rubbish' you can use below code:

from requests_html import HTML
from lxml.html import fromstring

html = HTML(html='<span><span class="data">important data</span> and some rubbish</span>')
data = fromstring(html.html)
# or without using requests_html.HTML: data = fromstring('<span><span class="data">important data</span> and some rubbish</span>')
print(data.xpath('//span[span[@class="data"]]/text()')[-1])  # " and some rubbish"

NOTE that data = html.find('.data', first=True) returns you <span class="data">important data</span> node which doesn't contain " and some rubbish" - it's a text child node of parent span!

Andersson
  • 51,635
  • 17
  • 77
  • 129
  • This does work, although I must admit it's not the *HTML Parsing for Humans* I hoped for. As to the note: I think that's the same as the first code block in my post? It *does* contain "and some rubbish", unless I missed something. – Norrius Apr 21 '18 at 12:22
  • I'm quite sure that the fact that `print(data.text)` contains `"and some rubbish"` is the bug of requests-html. And *HTML Parsing for Humans* is just a cheap attempt to advertise a product :) HTML itself is not quite clear for Human, so it's naive to hope for a tool that allows to simply handle HTML DOM – Andersson Apr 21 '18 at 12:42
  • I've been eyeing requests_html since it was released, hoping for a more pythonic alternative to lxml in parsing *simple* HTML. (I don't need it to do any heavy duty work!) I think I'll file an issue on GitHub regarding this. – Norrius Apr 21 '18 at 12:53
  • [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) might be used as alternative to lxml, requests-html is more like an alternative to Selenium headless, Scrapy, PyQt... – Andersson Apr 21 '18 at 12:59
0

the tail property exists with objects of type 'lxml.html.HtmlElement'.

I think what you are asking for is very easy to implement.

Here is a very simple example using requests_html and lxml:

from requests_html import HTML
html = HTML(html='<span><span class="data">important data</span> and some rubbish</span>')
data = html.find('span')
print (data[0].text) # important data and some rubbish
print (data[-1].text) # important data
print (data[-1].element.tail) #  and some rubbish

The element attribute points to the 'lxml.html.HtmlElement' object.

Hope this helps.

ab92
  • 65
  • 6