16

I've searched everywhere and what I most found was doc.xpath('//element[@class="classname"]'), but this does not work no matter what I try.

code I'm using

import lxml.html

def check():
    data = urlopen('url').read();
    return str(data);

doc = lxml.html.document_fromstring(check())
el = doc.xpath("//div[@class='test']")
print(el)

It simply prints an empty list.

Edit: How odd. I used google as a test page and it works fine there, but it doesn't work on the page I was using (youtube)

Here's the exact code I'm using.

import lxml.html
from urllib.request import urlopen
import sys

def check():
    data = urlopen('http://www.youtube.com/user/TopGear').read(); #TopGear as a test
    return data.decode('utf-8', 'ignore');


doc = lxml.html.document_fromstring(check())
el = doc.xpath("//div[@class='channel']")
print(el)
Vexx
  • 161
  • 1
  • 1
  • 4

3 Answers3

35

The TopGear page that you use for testing doesn't have any <div class="channel"> elements. But this works (for example):

el = doc.xpath("//div[@class='channel-title-container']")

Or this:

el = doc.xpath("//div[@class='a yb xr']")

To find <div> elements with a class attribute that contains the string channel, you could use

el = doc.xpath("//div[contains(@class, 'channel')]") 
mzjn
  • 48,958
  • 13
  • 128
  • 248
  • 1
    `branded-page channel` is not the same as `channel`. – mzjn Nov 24 '11 at 21:11
  • 1
    But, according to css, that element has two classes, branded-page and channel. So why wouldn't it? – Vexx Nov 24 '11 at 22:33
  • Yes, according to CSS there are two classes. But XPath does not know about the rules of CSS. To XPath, `branded-page channel` is just a string with no special meaning. – mzjn Nov 24 '11 at 23:03
  • That's actually helpful, thanks. Just as a test, I tried to get an element on this page, and it's not working either. This is really starting to piss me off. el = doc.xpath('//a[@class="vote-accepted-off"]') It appears that it doesn't like to find elements that don't have child elements. – Vexx Nov 25 '11 at 00:22
  • Just to complete your answer, we can aslo use **not()** for negation.Example: **`el = doc.xpath("//div[contains(@class, 'channel') and not(contains(@class, 'disabled'))]")`** – Efe Mar 03 '20 at 02:18
2

You can use lxml.cssselect to simplify class and id request: http://lxml.de/dev/cssselect.html

dmzkrsk
  • 2,011
  • 2
  • 20
  • 30
1

HTML uses classes (a lot), which makes them convenient to hook XPath queries. However XPath has no knowledge/support of CSS classes (or even space-separated lists) which makes classes a pain in the ass to check: the canonically correct way to look for elements having a specific class is:

//*[contains(concat(' ', normalize-space(@class), ' '), '$className')]

In your case this is

el = doc.xpath(
    "//div[contains(concat(' ', normalize-space(@class), ' '), 'channel')]"
)
# print(el)
# [<Element div at 0x7fa44e31ccc8>, <Element div at 0x7fa44e31c278>, <Element div at 0x7fa44e31cdb8>]

or use own XPath function hasclass(*classes)

def _hasaclass(context, *cls):
    return "your implementation ..." 

xpath_utils = etree.FunctionNamespace(None)
xpath_utils['hasaclass'] = _hasaclass

el = doc.xpath("//div[hasaclass('channel')]")
Andrei.Danciuc
  • 1,000
  • 10
  • 24
  • The second arg in `contains()` should also have spaces added (like `' $className '` and `' channel '`). Otherwise you'll still match classes like `somechannel`. – Daniel Haley Apr 29 '19 at 17:43