Beautifulsoup how does findAll work

Question

I've noticed some weird behavior of findAll's method:

>>> htmls="<html><body><p class=\"pagination-container\">slytherin</p><p class=\"pagination-container and something\">gryffindor</p></body></html>"
>>> soup=BeautifulSoup(htmls, "html.parser")
>>> for i in soup.findAll("p",{"class":"pagination-container"}):
    print(i.text)


slytherin
gryffindor
>>> for i in soup.findAll("p", {"class":"pag"}):
    print(i.text)


>>> for i in soup.findAll("p",{"class":"pagination-container"}):
    print(i.text)


slytherin
gryffindor
>>> for i in soup.findAll("p",{"class":"pagination"}):
    print(i.text)


>>> len(soup.findAll("p",{"class":"pagination-container"}))
2
>>> len(soup.findAll("p",{"class":"pagination-containe"}))
0
>>> len(soup.findAll("p",{"class":"pagination-contai"}))
0
>>> len(soup.findAll("p",{"class":"pagination-container and something"}))
1
>>> len(soup.findAll("p",{"class":"pagination-conta"}))
0

So, when we search for pagination-container it returns both the first and the second p tag. It made me think that it looks for a partial equality: something like if passed_string in class_attribute_value:. So I shortened the string in findAll method and it never managed to find anything!

How is that possible?

score 5 · Accepted Answer · edited May 23 '17 at 12:08

First of all, class is a special multi-valued space-delimited attribute and has a special handling.

When you write soup.findAll("p", {"class":"pag"}), BeautifulSoup would search for elements having class pag. It would split element class value by space and check if there is pag among the splitted items. If you had an element with class="test pag" or class="pag", it would be matched.

Note that in case of soup.findAll("p", {"class": "pagination-container and something"}), BeautifulSoup would match an element having the exact class attribute value. There is no splitting involved in this case - it just sees that there is an element where the complete class value equals the desired string.

To have a partial match on one of the classes, you can provide a regular expression or a function as a class filter value:

import re

soup.find_all("p", {"class": re.compile(r"pag")})  # contains pag
soup.find_all("p", {"class": re.compile(r"^pag")})  # starts with pag

soup.find_all("p", {"class": lambda class_: class_ and "pag" in class_})  # contains pag
soup.find_all("p", {"class": lambda class_: class_ and class_.startswith("pag")})  # starts with pag

There is much more to say, but you should also know that BeautifulSoup has CSS selector support (a limited one but covers most of the common use cases). You can write things like:

soup.select("p.pagination-container")  # one of the classes is "pagination-container"
soup.select("p[class='pagination-container']")  # match the COMPLETE class attribute value
soup.select("p[class^=pag]")  # COMPLETE class attribute value starts with pag

Handling class attribute values in BeautifulSoup is a common source of confusion and questions, please see these related topics to gain more understanding:

Beautifulsoup how does findAll work

1 Answers1

Linked