6

I parse HTML with Python.

After parsing I search for some elements in the tree.

I found no easy to use way to find elements in the tree up to now. XPath is available, but I prefer a familiar way.

Is there a way to use selectors in Python which have a syntax similar to jquery/css selectors?

guettli
  • 25,042
  • 81
  • 346
  • 663
  • 2
    Does this answer your question? [jquery-like HTML parsing in Python?](https://stackoverflow.com/questions/3051295/jquery-like-html-parsing-in-python) – imbr Jun 17 '20 at 18:01

2 Answers2

5

BeautifulSoup has CSS selectors support built-in:

>>> from bs4 import BeautifulSoup
>>> from urllib2 import urlopen
>>> soup = BeautifulSoup(urlopen("https://google.com"))
>>> soup.select("input[name=q]")
[<input autocomplete="off" class="lst" maxlength="2048" name="q" size="57" style="color:#000;margin:0;padding:5px 8px 0 6px;vertical-align:top" title="Google Search" value=""/>]

There is also cssselect package that you can use in combination with lxml.

Note that there are certain limitations in how CSS selectors work in BeautifulSoup - lxml+csselect support more CSS selectors:

This is all a convenience for users who know the CSS selector syntax. You can do all this stuff with the Beautiful Soup API. And if CSS selectors are all you need, you might as well use lxml directly: it’s a lot faster, and it supports more CSS selectors. But this lets you combine simple CSS selectors with the Beautiful Soup API.

alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • I get: `AttributeError: 'lxml.etree._Element' object has no attribute 'cssselect'` I use lxml version 3.3.3 – guettli Aug 26 '15 at 15:06
  • @guettli could you update to 3.4.4 and try again? Also, what code are you executing? – alecxe Aug 27 '15 at 00:23
  • This is a new question: See http://stackoverflow.com/questions/32264533/lxml-cssselect-attributeerror-lxml-etree-element-object-has-no-attribute – guettli Aug 28 '15 at 06:26
  • @guettli yeah, `lxml.html` has CSS selecting feature. If you are parsing html, you should use `lxml.html` and not `lxml.etree`. – alecxe Aug 31 '15 at 14:13
  • I don't know if this is new, but now I have to install `cssselect` through pip in order to follow this answer – Hubro Sep 16 '17 at 16:06
  • @Hubro yeah, `cssselect` was a part of `lxml` and now is a separately installed package - updated the answer accordingly. Thanks for the heads up! – alecxe Sep 16 '17 at 16:13
0

There is library called pyquery: https://pypi.python.org/pypi/pyquery

Here is an example from the docs:

>>> d = pq("<option value='1'><option value='2'>")
>>> d('option[value="1"]')
[<option>]
guettli
  • 25,042
  • 81
  • 346
  • 663