How to use lxml to find an element by text?

Question

Assume we have the following html:

<html>
    <body>
        <a href="/1234.html">TEXT A</a>
        <a href="/3243.html">TEXT B</a>
        <a href="/7445.html">TEXT C</a>
    <body>
</html>

How do I make it find the element "a", which contains "TEXT A"?

So far I've got:

root = lxml.html.document_fromstring(the_html_above)
e = root.find('.//a')

I've tried:

e = root.find('.//a[@text="TEXT A"]')

but that didn't work, as the "a" tags have no attribute "text".

Is there any way I can solve this in a similar fashion to what I've tried?

have you tried `:contains`? – Snakes and Coffee Jan 13 '13 at 02:14 — Snakes and Coffee, Jan 13 '13 at 02:14
refer to unutbu's answer – Snakes and Coffee Jan 13 '13 at 02:18 — Snakes and Coffee, Jan 13 '13 at 02:18

unutbu · Accepted Answer · 2013-01-13T02:21:08.200

60

You are very close. Use text()= rather than @text (which indicates an attribute).

e = root.xpath('.//a[text()="TEXT A"]')

Or, if you know only that the text contains "TEXT A",

e = root.xpath('.//a[contains(text(),"TEXT A")]')

Or, if you know only that text starts with "TEXT A",

e = root.xpath('.//a[starts-with(text(),"TEXT A")]')

See the docs for more on the available string functions.

For example,

import lxml.html as LH

text = '''\
<html>
    <body>
        <a href="/1234.html">TEXT A</a>
        <a href="/3243.html">TEXT B</a>
        <a href="/7445.html">TEXT C</a>
    <body>
</html>'''

root = LH.fromstring(text)
e = root.xpath('.//a[text()="TEXT A"]')
print(e)

yields

[<Element a at 0xb746d2cc>]

edited Jan 13 '13 at 02:21

answered Jan 13 '13 at 02:14

unutbu

842,883
184
1,785
1,677

2

That gives me SyntaxError: invalid predicate. – user1973386 Jan 13 '13 at 02:19
3

Right. `find`/`findAll` are simplified methods which do not allow all kinds of XPath. With the current version of lxml, `xpath` accepts XPath version 1.0. – unutbu Jan 13 '13 at 02:39
Oops, just deleted the comment before you posted. I replaced find and findAll in my code and it works. Thank you once more :) – user1973386 Jan 13 '13 at 02:46
I get [] but when i try to fetch the text i get this error UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 82: invalid start byte pls help – Dev_Man Dec 02 '17 at 15:04
Thanks. I was using `root.find(`… instead of `root.xpath(`…. – Geremia Jul 07 '20 at 21:34
The xpath docs linked above are long and difficult to read. There is a really good introductory presentation [here](https://courses.ischool.berkeley.edu/i290-14/s05/lecture-4/allslides.html) with everything you need to get going. – John Nov 10 '20 at 23:04

score 9 · Answer 2 · answered Jul 20 '13 at 17:21

9

Another way that looks more straightforward to me:

results = []
root = lxml.hmtl.fromstring(the_html_above)
for tag in root.iter():
    if "TEXT A" in tag.text
        results.append(tag)

answered Jul 20 '13 at 17:21

ToonAlfrink

2,501
2
19
19

Or of course `tag.text == "TEXT A"` if you're looking for the exact text – ToonAlfrink Jul 20 '13 at 17:22

How to use lxml to find an element by text?

2 Answers2

Linked