I have some simple code...
from bs4 import BeautifulSoup, SoupStrainer
text = """<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<div></div>
<div class='detail'></div>
<div></div>
<div class='detail'></div>
<div></div>"""
for div in BeautifulSoup(text, 'lxml', parse_only = SoupStrainer('div', attrs = { 'class': 'detail' })):
print(div)
...which I expect to print two divs with class 'detail'. Instead, I get the two divs and the doctype, for some reason:
html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"
<div class="detail"></div>
<div class="detail"></div>
What's happening here? How can I avoid matching the doctype?
EDIT
Here is one filtering method I found:
from bs4 import BeautifulSoup, SoupStrainer, Doctype
...
for div in BeautifulSoup(text, 'lxml', parse_only = SoupStrainer('div', attrs = { 'class': 'detail' })):
if type(div) is Doctype:
continue
Still interested in knowing how to avoid the situation where I have to filter out the doctype while using SoupStrainer
.
The reason I want to use SoupStrainer
instead of find_all
is beacuse SoupStrainer
is almost two times faster, which adds up to ~30 seconds of a difference with just a 1000 parsed pages:
def soup_strainer(text):
[div for div in BeautifulSoup(text, 'lxml', parse_only = SoupStrainer('div', attrs = { 'class': 'detail' })) if type(div) is not Doctype]
def find_all(text):
[div for div in BeautifulSoup(text, 'lxml').find_all('div', { 'class': 'detail' })]
from timeit import timeit
print( timeit('soup_strainer(text)', number = 1000, globals = globals()) ) # 38.091634516923584
print( timeit('find_all(text)', number = 1000, globals = globals()) ) # 65.1686057066947