Finding html element with class using lxml

Question

I've searched everywhere and what I most found was doc.xpath('//element[@class="classname"]'), but this does not work no matter what I try.

code I'm using

import lxml.html

def check():
    data = urlopen('url').read();
    return str(data);

doc = lxml.html.document_fromstring(check())
el = doc.xpath("//div[@class='test']")
print(el)

It simply prints an empty list.

Edit: How odd. I used google as a test page and it works fine there, but it doesn't work on the page I was using (youtube)

Here's the exact code I'm using.

import lxml.html
from urllib.request import urlopen
import sys

def check():
    data = urlopen('http://www.youtube.com/user/TopGear').read(); #TopGear as a test
    return data.decode('utf-8', 'ignore');


doc = lxml.html.document_fromstring(check())
el = doc.xpath("//div[@class='channel']")
print(el)

`'url'` is a 3-character string. It is not a HTML file. – mzjn Nov 22 '11 at 22:43 — mzjn, Nov 22 '11 at 22:43
Obviously I did that instead of posting the real url. – Vexx Nov 23 '11 at 16:46 — Vexx, Nov 23 '11 at 16:46
Please provide a [SSCCE](http://sscce.org/). – mzjn Nov 23 '11 at 18:46 — mzjn, Nov 23 '11 at 18:46

mzjn · Answer 1 · 2011-11-24T21:12:55.050

35

The TopGear page that you use for testing doesn't have any <div class="channel"> elements. But this works (for example):

el = doc.xpath("//div[@class='channel-title-container']")

Or this:

el = doc.xpath("//div[@class='a yb xr']")

To find <div> elements with a class attribute that contains the string channel, you could use

el = doc.xpath("//div[contains(@class, 'channel')]")

edited Nov 24 '11 at 21:12

answered Nov 24 '11 at 17:16

mzjn

48,958
13
128
248

1

`branded-page channel` is not the same as `channel`. – mzjn Nov 24 '11 at 21:11
1

But, according to css, that element has two classes, branded-page and channel. So why wouldn't it? – Vexx Nov 24 '11 at 22:33
Yes, according to CSS there are two classes. But XPath does not know about the rules of CSS. To XPath, `branded-page channel` is just a string with no special meaning. – mzjn Nov 24 '11 at 23:03
That's actually helpful, thanks. Just as a test, I tried to get an element on this page, and it's not working either. This is really starting to piss me off. el = doc.xpath('//a[@class="vote-accepted-off"]') It appears that it doesn't like to find elements that don't have child elements. – Vexx Nov 25 '11 at 00:22
Just to complete your answer, we can aslo use **not()** for negation.Example: **`el = doc.xpath("//div[contains(@class, 'channel') and not(contains(@class, 'disabled'))]")`** – Efe Mar 03 '20 at 02:18

score 2 · Answer 2 · answered Jan 26 '12 at 02:56

2

You can use lxml.cssselect to simplify class and id request: http://lxml.de/dev/cssselect.html

answered Jan 26 '12 at 02:56

dmzkrsk

2,011
2
20
30

score 1 · Answer 3 · answered Apr 28 '19 at 14:40

HTML uses classes (a lot), which makes them convenient to hook XPath queries. However XPath has no knowledge/support of CSS classes (or even space-separated lists) which makes classes a pain in the ass to check: the canonically correct way to look for elements having a specific class is:

//*[contains(concat(' ', normalize-space(@class), ' '), '$className')]

In your case this is

el = doc.xpath(
    "//div[contains(concat(' ', normalize-space(@class), ' '), 'channel')]"
)
# print(el)
# [<Element div at 0x7fa44e31ccc8>, <Element div at 0x7fa44e31c278>, <Element div at 0x7fa44e31cdb8>]

or use own XPath function hasclass(*classes)

def _hasaclass(context, *cls):
    return "your implementation ..." 

xpath_utils = etree.FunctionNamespace(None)
xpath_utils['hasaclass'] = _hasaclass

el = doc.xpath("//div[hasaclass('channel')]")

The second arg in `contains()` should also have spaces added (like `' $className '` and `' channel '`). Otherwise you'll still match classes like `somechannel`. — Daniel Haley, Apr 29 '19 at 17:43

Finding html element with class using lxml

3 Answers3

Linked