2

I have some simple code...

from bs4 import BeautifulSoup, SoupStrainer

text = """<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<div></div>
<div class='detail'></div>
<div></div>
<div class='detail'></div>
<div></div>"""

for div in BeautifulSoup(text, 'lxml', parse_only = SoupStrainer('div', attrs = { 'class': 'detail' })):
    print(div)

...which I expect to print two divs with class 'detail'. Instead, I get the two divs and the doctype, for some reason:

html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"
<div class="detail"></div>
<div class="detail"></div>

What's happening here? How can I avoid matching the doctype?

EDIT

Here is one filtering method I found:

from bs4 import BeautifulSoup, SoupStrainer, Doctype
...
for div in BeautifulSoup(text, 'lxml', parse_only = SoupStrainer('div', attrs = { 'class': 'detail' })):
    if type(div) is Doctype:
        continue

Still interested in knowing how to avoid the situation where I have to filter out the doctype while using SoupStrainer.

The reason I want to use SoupStrainer instead of find_all is beacuse SoupStrainer is almost two times faster, which adds up to ~30 seconds of a difference with just a 1000 parsed pages:

def soup_strainer(text):
    [div for div in BeautifulSoup(text, 'lxml', parse_only = SoupStrainer('div', attrs = { 'class': 'detail' })) if type(div) is not Doctype]

def find_all(text):
    [div for div in BeautifulSoup(text, 'lxml').find_all('div', { 'class': 'detail' })]

from timeit import timeit    
print( timeit('soup_strainer(text)', number = 1000, globals = globals()) ) # 38.091634516923584
print( timeit('find_all(text)', number = 1000, globals = globals()) ) # 65.1686057066947
Alex Undefined
  • 620
  • 1
  • 5
  • 8
  • check [this](https://stackoverflow.com/questions/17943992/beautifulsoup-and-soupstrainer-for-getting-links-dont-work-with-hasattr-returni) post and [this](https://stackoverflow.com/questions/17988884/lxml-incorrectly-parsing-the-doctype-while-looking-for-links) post - I think they will resolve you issue and answer your question... – coder Sep 15 '17 at 13:57
  • if you want to filter out `!DOCTYPE` also, use `find_all` instead of `soupstrainer`... – coder Sep 15 '17 at 13:58

1 Answers1

1

I think you don't need to use SoupStrainer for this task. Instead, the built-in findAll method should do what you want. Here is the code I tested and seems to work fine:

[div for div in BeautifulSoup(text, 'lxml').findAll('div', {'class':'detail'})]

This will create a list of the div's you are looking for, excluding the DOCTYPE

Hope this helps.

sadmicrowave
  • 39,964
  • 34
  • 108
  • 180
  • `findAll` is quite slow compared to `SoupStrainer`. – Alex Undefined Sep 15 '17 at 23:46
  • @AlexUndefined I agree; however, `SoupStrainer` is not going to filter `doctype` or `text` elements, it's just how it works. See this post for more evidence https://stackoverflow.com/questions/17943992/beautifulsoup-and-soupstrainer-for-getting-links-dont-work-with-hasattr-returni – sadmicrowave Sep 18 '17 at 12:59