69

Is there any Python library that allows me to parse an HTML document similar to what jQuery does?

i.e. I'd like to be able to use CSS selectors syntax to grab an arbitrary set of nodes from the document, read their content/attributes, etc.

The only Python HTML parsing lib I've used before was BeautifulSoup, and even though it's fine I keep thinking it would be faster to do my parsing if I had jQuery syntax available. :D

imbr
  • 6,226
  • 4
  • 53
  • 65
Roy Tang
  • 5,643
  • 9
  • 44
  • 74
  • latest [BeautifulSoup has support for css-selectors](https://stackoverflow.com/a/62435195/1207193) now – imbr Jun 17 '20 at 17:50

4 Answers4

64

If you are fluent with BeautifulSoup, you could just add soupselect to your libs.
Soupselect is a CSS selector extension for BeautifulSoup.

Usage:

from bs4 import BeautifulSoup as Soup
from soupselect import select
import urllib
soup = Soup(urllib.urlopen('http://slashdot.org/'))
select(soup, 'div.title h3')
    [<h3><span><a href='//science.slashdot.org/'>Science</a>:</span></h3>,
     <h3><a href='//slashdot.org/articles/07/02/28/0120220.shtml'>Star Trek</h3>,
    ..]
Jason
  • 9,408
  • 5
  • 36
  • 36
systempuntoout
  • 71,966
  • 47
  • 171
  • 241
  • This sounds like the best solution for me right now, I'll give it a try. Thanks! – Roy Tang Jun 16 '10 at 07:42
  • 6
    It's now `from bs4` for Beautiful Soup 4 – Flash Jun 29 '13 at 01:15
  • 10
    In case you have problem to install soupselect, you should try the pip compatible version offered her https://github.com/syabro/soupselect : `sudo pip install https://github.com/syabro/soupselect/archive/master.zip` – AsTeR Jan 22 '14 at 17:59
  • 5
    Just to mention, Beautiful Soup 4 already incorporates the soupselect project having built-in support for CSS selectors. See the [release note](http://www.crummy.com/2012/03/14/0). – nn0p Aug 24 '15 at 01:56
50

Consider PyQuery:

http://packages.python.org/pyquery/

>>> from pyquery import PyQuery as pq
>>> from lxml import etree
>>> import urllib
>>> d = pq("<html></html>")
>>> d = pq(etree.fromstring("<html></html>"))
>>> d = pq(url='http://google.com/')
>>> d = pq(url='http://google.com/', opener=lambda url: urllib.urlopen(url).read())
>>> d = pq(filename=path_to_html_file)
>>> d("#hello")
[<p#hello.hello>]
>>> p = d("#hello")
>>> p.html()
'Hello world !'
>>> p.html("you know <a href='http://python.org/'>Python</a> rocks")
[<p#hello.hello>]
>>> p.html()
u'you know <a href="http://python.org/">Python</a> rocks'
>>> p.text()
'you know Python rocks'
Luke Stanley
  • 1,274
  • 1
  • 16
  • 32
14

The lxml library supports CSS selectors.

zanetu
  • 3,740
  • 1
  • 21
  • 17
Ignacio Vazquez-Abrams
  • 776,304
  • 153
  • 1,341
  • 1,358
9

BeautifulSoup, now has support for css selectors

import requests
from bs4 import BeautifulSoup as Soup
html = requests.get('https://stackoverflow.com/questions/3051295').content
soup = Soup(html)

Title of this question

soup.select('h1.grid--cell :first-child')[0].text

Number of question upvotes

# first item 
soup.select_one('[itemprop="upvoteCount"]').text

using Python Requests to get the html page

Community
  • 1
  • 1
imbr
  • 6,226
  • 4
  • 53
  • 65