0

I am trying to parse with BeautifulSoup html.parser, and I am having trouble with the tag, in that it is being processed differently than other tags:

On the <title> tag, it works as expected:

>>> BeautifulSoup("<title>Somalia’s Electoral Crisis in Extremis</title>", features='html.parser')
<title>Somalia’s Electoral Crisis in Extremis</title>

However when processing the <link> tag, it introduces a slash in the opening tag and drops the closing tag:

>>>BeautifulSoup("<link>https://warontherocks.com/2021/04/somalias-electoral-crisis-in-extremis/</link>", features='html.parser')
<link/>https://warontherocks.com/2021/04/somalias-electoral-crisis-in-extremis/

Why is it doing this?

Now if I use the 'lxml' or 'xml' tags, it works fine.

>>> BeautifulSoup("<link>https://warontherocks.com/2021/04/somalias-electoral-crisis-in-extremis/</link>", features='lxml')
<html><head><link/></head><body><p>https://warontherocks.com/2021/04/somalias-electoral-crisis-in-extremis/</p></body></html>

I am using html.parser because I also encounter nested elements (tags like <something:tag>) and CDATA strings. So parsing CDATA with lxml (which did not work for me) would also be a solution if it is possible.

Am I going to have to write some logic to decide which library to parse each site with, or is there a way to do this with BeautifulSoup as is?

Stonecraft
  • 860
  • 1
  • 12
  • 30
  • 2
    lxml is faster and more forgiving so I generally use that. There is an existing post comparing the various parsers as well as in the documentation. Link should have no end tag: https://html.spec.whatwg.org/#the-link-element. I assume lxml is attempting a repair on this html? – QHarr Apr 03 '21 at 01:19
  • Thanks, but is there a way to handle CDATA in `lxml`? That is why I am not using it. – Stonecraft Apr 03 '21 at 01:19
  • 1
    https://stackoverflow.com/questions/13694143/parsing-cdata-in-xml-with-python. ![CDATA[]] is an instruction that content should not be interpreted as xml. There are probably more answers on how to work with CDATA. I will have a quick look. This is just the one that came to mind. I note you wanted to use lxml. – QHarr Apr 03 '21 at 01:21
  • Or is there a way to specify custom tags for cases where I encounter ones that are malformed in a specific way? – Stonecraft Apr 03 '21 at 01:21
  • 1
    https://stackoverflow.com/questions/37661822/python-lxml-modify-cdata, https://stackoverflow.com/questions/25813756/lxml-kills-my-cdata-sections ....and possibly from these: https://stackoverflow.com/search?q=lxml+cdata – QHarr Apr 03 '21 at 01:23

0 Answers0