38

Assume we have the following html:

<html>
    <body>
        <a href="/1234.html">TEXT A</a>
        <a href="/3243.html">TEXT B</a>
        <a href="/7445.html">TEXT C</a>
    <body>
</html>

How do I make it find the element "a", which contains "TEXT A"?

So far I've got:

root = lxml.html.document_fromstring(the_html_above)
e = root.find('.//a')

I've tried:

e = root.find('.//a[@text="TEXT A"]')

but that didn't work, as the "a" tags have no attribute "text".

Is there any way I can solve this in a similar fashion to what I've tried?

Raymond Reddington
  • 1,709
  • 1
  • 13
  • 21
user1973386
  • 1,095
  • 2
  • 10
  • 18

2 Answers2

60

You are very close. Use text()= rather than @text (which indicates an attribute).

e = root.xpath('.//a[text()="TEXT A"]')

Or, if you know only that the text contains "TEXT A",

e = root.xpath('.//a[contains(text(),"TEXT A")]')

Or, if you know only that text starts with "TEXT A",

e = root.xpath('.//a[starts-with(text(),"TEXT A")]')

See the docs for more on the available string functions.


For example,

import lxml.html as LH

text = '''\
<html>
    <body>
        <a href="/1234.html">TEXT A</a>
        <a href="/3243.html">TEXT B</a>
        <a href="/7445.html">TEXT C</a>
    <body>
</html>'''

root = LH.fromstring(text)
e = root.xpath('.//a[text()="TEXT A"]')
print(e)

yields

[<Element a at 0xb746d2cc>]
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • 2
    That gives me SyntaxError: invalid predicate. – user1973386 Jan 13 '13 at 02:19
  • 3
    Right. `find`/`findAll` are simplified methods which do not allow all kinds of XPath. With the current version of lxml, `xpath` accepts XPath version 1.0. – unutbu Jan 13 '13 at 02:39
  • Oops, just deleted the comment before you posted. I replaced find and findAll in my code and it works. Thank you once more :) – user1973386 Jan 13 '13 at 02:46
  • I get [] but when i try to fetch the text i get this error UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 82: invalid start byte pls help – Dev_Man Dec 02 '17 at 15:04
  • Thanks. I was using `root.find(`… instead of `root.xpath(`…. – Geremia Jul 07 '20 at 21:34
  • The xpath docs linked above are long and difficult to read. There is a really good introductory presentation [here](https://courses.ischool.berkeley.edu/i290-14/s05/lecture-4/allslides.html) with everything you need to get going. – John Nov 10 '20 at 23:04
9

Another way that looks more straightforward to me:

results = []
root = lxml.hmtl.fromstring(the_html_above)
for tag in root.iter():
    if "TEXT A" in tag.text
        results.append(tag)
ToonAlfrink
  • 2,501
  • 2
  • 19
  • 19