1

I've been trying to get a full text hosted inside a <div> element from the web page https://www.list-org.com/company/11665809.
The element should contain a sub-string "Арбитраж".
And it does, because my code

for div in tree.xpath('.//div[contains(text(), "Арбитраж")]'):
    print(div)

returns response

Element div at 0x15480d93ac8

enter image description here

But when I'm trying to get the full text itself by using method div.text, it returns None
Which is a strange result, I think. What should I do?
Any help would be greatly appreciated. As well as an advice about source to learn basics of HTML (not a savvy programmer guy) to avoid such an easy question in the future.

Sergey Solod
  • 695
  • 7
  • 15

1 Answers1

1

This is one of these strange things that happens when xpath is handled by a host language and library. When you use the xpath expression

 .//div[contains(text(), "Арбитраж")] 

the search is performed according to xpath rules, which considers the target text as contained within the target div. When you go on to the next line:

print(div.text)

you are using lxml.html, which apparently doesn't regard the target text as part of the div text, because it's preceded by the <i> tag. To get to it, with lxml.html, you have to use:

print(div.text_content())

or with xpath only:

print(tree.xpath('.//div[contains(text(), "Арбитраж")]/text()')[0])

It seems lxml.etree and beautifulsoup use different approaches. See this interesting discussion here.

Jack Fleeting
  • 24,385
  • 6
  • 23
  • 45
  • **Thanks a lot, @Jack Fleeting.** In this case `print(div.text_content())` returns an error `AttributeError: 'lxml.etree._Element' object has no attribute 'text_content'`, but `print(tree.xpath('.//div[contains(text(), "Арбитраж")]/text()')[0])` works. I can't say that it works as I want it does, because it gives only `Арбитраж: ` which is not a full text of element, `Арбитраж (1 шт.):`. Anyway now I know reasons. – Sergey Solod May 10 '20 at 11:53