I am now confused by something I thought I understood, but turns out I've been taking for granted.
One frequently encounters this type of for
loop:
from bs4 import BeautifulSoup as bs
mystring = 'some string'
soup = bs(mystring,'html.parser')
for elem in soup.find_all():
[do something with elem]
What I haven't paid much attention to, is what the elem
actually is until I ran into a version of this simplified string:
mystring = 'opening text<p>text one<BR> text two.<br></p>\
<p align="right">text three<br/> text four.</p><p class="myclass">text five. </p>\
<p>text six <span style="some style">text seven</span></p>\
<p>text 8. <span style="some other style">text nine</span></p>closing text'
I'm not sure anymore what I expected the output to be but when I ran this code:
counter = 1 #using 'normal' counting for simplification
for elem in soup.find_all():
print('elem ',counter,elem)
counter +=1
The output was:
elem 1 <p>text one<br/> text two.<br/></p>
elem 2 <br/>
elem 3 <br/>
elem 4 <p align="right">text three<br> text four.</br></p>
elem 5 <br> text four.</br>
elem 6 <p class="myclass">text five. </p>
elem 7 <p>text six <span style="some style">text seven</span></p>
elem 8 <span style="some style">text seven</span>
elem 9 <p>text 8. <span style="some other style">text nine</span></p>
elem 10 <span style="some other style">text nine</span>
So bs4+html.parser found 10 elements in the string. Their selection and presentation seemed unintuitive to me (for example, skipping opening text
and closing text
). Not only that, but the output ofprint(len(soup))
turned out to be 7
!
So just to make sure, I swapped out html.parser
for both lxml
and html5lib
. In both cases the print(len(soup))
was not only 1
, but the number of elem
s jumped up to 13! And, naturally, the extra elements were different. From the 4th elem
thru the end, both libraries were identical to html.parser
. For the first three, however...
With html5lib
you get:
elem 1 <html><head></head><body>opening text<p>text one<br/> text two.<br/></p><p align="right">text three<br/> text four.</p><p class="myclass">text five. </p><p>text six <span style="some style">text seven</span></p><p>text 8. <span style="some other style">text nine</span></p>closing text</body></html>
elem 2 <head></head>
elem 3 <body>opening text<p>text one<br/> text two.<br/></p><p align="right">text three<br/> text four.</p><p class="myclass">text five. </p><p>text six <span style="some style">text seven</span></p><p>text 8. <span style="some other style">text nine</span></p>closing text</body>
With lxml
, on the other hand, you get:
elem 1 <html><body><p>opening text</p><p>text one<br/> text two.<br/></p><p align="right">text three<br/> text four.</p><p class="myclass">text five. </p><p>text six <span style="some style">text seven</span></p><p>text 8. <span style="some other style">text nine</span></p>closing text</body></html>
elem 2 <body><p>opening text</p><p>text one<br/> text two.<br/></p><p align="right">text three<br/> text four.</p><p class="myclass">text five. </p><p>text six <span style="some style">text seven</span></p><p>text 8. <span style="some other style">text nine</span></p>closing text</body>
elem 3 <p>opening text</p>
So what is the philosophy behind all this? Whose 'fault' is it? Is there a 'right' or 'wrong' answer? And, practically speaking, should I just follow religiously one parser or is there a time and place for each?
Apologies for the length of the question.