Python RegEx with Beautifulsoup 4 not working

Question

I want to find all div tags which have a certain pattern in their class name but my code is not working as desired.

This is the code snippet

soup = BeautifulSoup(html_doc, 'html.parser')

all_findings = soup.findAll('div',attrs={'class':re.compile(r'common text .*')})

where html_doc is the string with the following html

<div class="common text sighting_4619012">

  <div class="hide-c">
    <div class="icon location"></div>
    <p class="reason"></p>
    <p class="small">These will not appear</p>
    <span class="button secondary ">wait</span>
  </div>

  <div class="show-c">
  </div>

</div>

But all_findings is coming out as an empty list while it should have found one item.

It's working in the case of exact match

all_findings = soup.findAll('div',attrs={'class':re.compile(r'hide-c')})

I am using bs4.

Have a look at [this SO post](http://stackoverflow.com/questions/13794532/python-regular-expression-for-beautiful-soup). Is it helpful? If it answers your question, yours is a duplicate. So, bs4 sees `common text sighting_4619012` as an array of `common` `text` `sighting_4619012`. Regex is applied to each of them separately. — Wiktor Stribiżew, Aug 13 '15 at 19:07
Ah, I think I got what you meant now. That post you mentioned didn't mention `Regex is applied to each of them separately` so I couldn't make this out. But what if we want to find the match( using regex) according to 2 items of the list. Ex - `"text", "sighting_4619012"` Double Ah, I think I got my doubt cleared with @alecxe's answer. — Shivendra, Aug 13 '15 at 19:36
[*HTML 4 defines a few attributes that can have multiple values. HTML 5 removes a couple of them, but defines a few more. The most common multi-valued attribute is `class` (that is, a tag can have more than one CSS class). Others include `rel`, `rev`, `accept-charset`, `headers`, and `accesskey`. Beautiful Soup presents the value(s) of a multi-valued attribute as a **list***](http://www.crummy.com/software/BeautifulSoup/bs4/doc/#multivalue). — Wiktor Stribiżew, Aug 13 '15 at 19:51
Copying my comment here too - I faced one problem today though, it is matching classes with only "common" as value. How to make such that each of the matches are satisfied? — Shivendra, Aug 14 '15 at 08:29

score 2 · Answer 1 · answered Aug 13 '15 at 18:34

Instead of using a regular expression, put the classes you are looking for in a list:

all_findings = soup.findAll('div',attrs={'class':['common', 'text']})

Example code:

from bs4 import BeautifulSoup

html_doc = """<div class="common text sighting_4619012">

  <div class="hide-c">
    <div class="icon location"></div>
    <p class="reason"></p>
    <p class="small">These will not appear</p>
    <span class="button secondary ">wait</span>
  </div>

  <div class="show-c">
  </div>

</div>"""
soup = BeautifulSoup(html_doc, 'html.parser')
all_findings = soup.findAll('div',attrs={'class':['common', 'text']})
print all_findings

This outputs:

[<div class="common text sighting_4619012">
<div class="hide-c">
<div class="icon location"></div>
<p class="reason"></p>
<p class="small">These will not appear</p>
<span class="button secondary ">wait</span>
</div>
<div class="show-c">
</div>
</div>]

Umm, this might work here, but later on I am sure in near future I'll have to deal with scenarios which can be easily dealt with RegEx only. `Ex- "common text sighting_46....." , to find all tags starting with 46 in the example and followed by, lets say 5 numbers.` — Shivendra, Aug 13 '15 at 18:39

alecxe · Accepted Answer · 2015-08-16T02:54:55.870

0

To extend @Andy's answer, you can make a list of class names and compiled regular expressions:

soup.find_all('div', {'class': ["common", "text", re.compile(r'sighting_\d{5}')]})

Note that, in this case, you'll get the div elements with one of the specified classes/patterns - in other words, it's common or text or sighting_ followed by five digits.

If you want to have them joined with "and", one option would be to turn off the special treatment for "class" attributes by having the document parsed as "xml":

soup = BeautifulSoup(html_doc, 'xml')
all_findings = soup.find_all('div', class_=re.compile(r'common text sighting_\d{5}'))
print all_findings

edited Aug 16 '15 at 02:54

answered Aug 13 '15 at 19:33

alecxe

462,703
120
1,088
1,195

Now I know that I was essentially looking for applying RegEx to each value of an attribute of a tag. And with the exact match examples I couldn't figure out how to go about with the multiple regex case . – Shivendra Aug 13 '15 at 19:51
I faced one problem today though, it is matching classes with only `"common"` as value. How to make such that each of the matches are satisfied? – Shivendra Aug 14 '15 at 08:27
@Shivendra I've updated the answer with an option. The problem is that, if you are parsing it as HTML, you'll have `class` treated as a multi-valued attribute.. – alecxe Aug 16 '15 at 02:55
That can be done, but I used a not so elegant workaround. In a `for loop` I check for each of RegEx separately and the elements which passed all the checks were selected. Initially I short-listed candidates to check further by the one-liner you had given earlier `soup.find_all('div', {'class': ["common", "text", re.compile(r'sighting_\d{5}')]})`. ;) – Shivendra Aug 16 '15 at 09:32

Python RegEx with Beautifulsoup 4 not working

2 Answers2