0

I have HTML with Cyrillic characters. I am using BeautifulSoup4 to process this. It works great, but when I go to prettify, it converts all the Cyrillic characters to something else. Here is a dummy example using Python3:

from bs4 import BeautifulSoup

hello = '<span>Привет, мир</span>'
soup = BeautifulSoup(hello, 'html.parser')
print("Before prettify:\n{}".format(soup))
soup = soup.prettify(formatter='html')
print("\nafter prettify:\n{}".format(soup))

Here is the output it generates:

Before prettify:
<span>Привет, мир</span>

after prettify:
<span>
 &Pcy;&rcy;&icy;&vcy;&iecy;&tcy;, &mcy;&icy;&rcy;
</span>

It's formatting the HTML properly (putting the tags on their lines), but it's converting the Cyrillic characters to something else (I'm not even certain what encoding that is, to be honest.)

I have tried various things to prevent this; prettify(encoding=None, formatter='html'), prettify(encoding='utf-8', formatter='html'), I have also tried changing the way I create the soup object: soup = BeautifulSoup(hello.encode('utf-8'), 'html.parser') and soup = BeautifulSoup(hello, 'html.parser', from_encoding='utf-8') - nothing seems to change what happens to the Cyrillic characters during prettify.

I figure this must be a very simple mistake I am making with encoding parameters somewhere, but after searching the internet and BS4 documentation, I am unable to figure this out. Is there a way to use BeautifulSoup's prettify, but maintain the Cyrillic characters as they were originally, or is this not possible?

EDIT: I have realized now (thanks to DYZ's answer), that removing formatter='html' from the call to prettify will stop BeautifulSoup from converting the Cyrillic chars. Unfortunately, this also removes any &nbsp chars in the document. After having a look at BS4's output-formatters documentation, it seems the solution is to create a custom formatter using BS's Formatter class, and specifying this in the call to prettify - soup.prettify(formatter=my_formatter). I'm not sure yet what that would entail, though. I have posted this Stackoverflow question to try and solve this separate problem. (format prettify to both preserve &nbsp and Cryillic characters EDIT: See answer to that question - I finally figured it out.)

bikz
  • 415
  • 4
  • 11

1 Answers1

0

From the documentation:

If you pass in formatter="html", Beautiful Soup will convert Unicode characters to HTML entities whenever possible.

If this is not desirable, do not use the HTML formatter:

soup.prettify()
#'<span>\n Привет, мир\n</span>'
DYZ
  • 55,249
  • 10
  • 64
  • 93
  • Ah yes, I recall now why I was doing this (adding formatter="html" in the call to prettify). If I don't add this, prettify removes my &nbsp characters and converts them to whitespace, which messes up the way the HTML displays, unfortunately. – bikz Oct 30 '21 at 23:54
  • In case anyone is curious, I have solved the problem mentioned above - you can supply a custom formatter to prettify which will preserve both the Cyrillic and the &nbsp. See the answer I posted to my other stackoverflow question here: https://stackoverflow.com/questions/69790205/prettify-with-beautifulsoup-using-a-formatter-that-will-preserve-nbsp-and-cyril/69790637#69790637 – bikz Oct 31 '21 at 21:11