How to get text from HTML element by using lxml.html

Question

I've been trying to get a full text hosted inside a <div> element from the web page https://www.list-org.com/company/11665809.
The element should contain a sub-string "Арбитраж".
And it does, because my code

for div in tree.xpath('.//div[contains(text(), "Арбитраж")]'):
    print(div)

returns response

Element div at 0x15480d93ac8

But when I'm trying to get the full text itself by using method div.text, it returns None
Which is a strange result, I think. What should I do?
Any help would be greatly appreciated. As well as an advice about source to learn basics of HTML (not a savvy programmer guy) to avoid such an easy question in the future.

score 1 · Accepted Answer · answered May 10 '20 at 10:50

This is one of these strange things that happens when xpath is handled by a host language and library. When you use the xpath expression

 .//div[contains(text(), "Арбитраж")]

the search is performed according to xpath rules, which considers the target text as contained within the target div. When you go on to the next line:

print(div.text)

you are using lxml.html, which apparently doesn't regard the target text as part of the div text, because it's preceded by the <i> tag. To get to it, with lxml.html, you have to use:

print(div.text_content())

or with xpath only:

print(tree.xpath('.//div[contains(text(), "Арбитраж")]/text()')[0])

It seems lxml.etree and beautifulsoup use different approaches. See this interesting discussion here.

**Thanks a lot, @Jack Fleeting.** In this case `print(div.text_content())` returns an error `AttributeError: 'lxml.etree._Element' object has no attribute 'text_content'`, but `print(tree.xpath('.//div[contains(text(), "Арбитраж")]/text()')[0])` works. I can't say that it works as I want it does, because it gives only `Арбитраж: ` which is not a full text of element, `Арбитраж (1 шт.):`. Anyway now I know reasons. — Sergey Solod, May 10 '20 at 11:53

How to get text from HTML element by using lxml.html

1 Answers1