I am trying to parse with BeautifulSoup html.parser, and I am having trouble with the tag, in that it is being processed differently than other tags:
On the <title>
tag, it works as expected:
>>> BeautifulSoup("<title>Somalia’s Electoral Crisis in Extremis</title>", features='html.parser')
<title>Somalia’s Electoral Crisis in Extremis</title>
However when processing the <link>
tag, it introduces a slash in the opening tag and drops the closing tag:
>>>BeautifulSoup("<link>https://warontherocks.com/2021/04/somalias-electoral-crisis-in-extremis/</link>", features='html.parser')
<link/>https://warontherocks.com/2021/04/somalias-electoral-crisis-in-extremis/
Why is it doing this?
Now if I use the 'lxml'
or 'xml'
tags, it works fine.
>>> BeautifulSoup("<link>https://warontherocks.com/2021/04/somalias-electoral-crisis-in-extremis/</link>", features='lxml')
<html><head><link/></head><body><p>https://warontherocks.com/2021/04/somalias-electoral-crisis-in-extremis/</p></body></html>
I am using html.parser
because I also encounter nested elements (tags like <something:tag>
) and CDATA
strings. So parsing CDATA with lxml
(which did not work for me) would also be a solution if it is possible.
Am I going to have to write some logic to decide which library to parse each site with, or is there a way to do this with BeautifulSoup as is?