I'm generating some HTML with python and BeautifulSoup4. At the end, I'd like to prettify the generated HTML. If I prettify as follows:
soup.prettify()
BeautifulSoup converts all the   characters to spaces. Unfortunately, my webpage relies on having these   characters. After some guidance, I realized that this can be overcome by supplying a formatter to prettify:
soup.prettify(formatter='html')
Unfortunately, when I do this, though the   characters are preserved, BeautifulSoup encodes the Cyrillic (Russian) characters in my HTML, making them unreadable to me. This leaves the formatter='html' option off limits to me.
(formatter='minimal'
and formatter=None
also don't work; they leave Cyrillic alone, but take away the  .)
After looking at the BeautifulSoup docs, I realized you can specify your own custom formatter using BeautifulSoup's Formatter class. Unfortunately, I am unsure how this class works. I have tried to find documentation for the Formatter class but I am unable. Does anyone know if it's possible to write a custom formatter, that will tell BeautifulSoup to preserve   characters (and leave my Cyrillic characters alone)? Or, is there any documentation for how this class works exactly? There are some examples in that section of the BS documentation, but after reading them, I am still unclear how to accomplish what I'm trying to accomplish.
EDIT: I have found different documentation, which makes it much clearer. The custom formatter is just a function you pass to the 'formatter' arg (i.e. prettify(formatter=my_func)
, where my_func is a function you define on your own); it gets called once for every String and attribute value encountered, passing that value to the function and using whatever the function returns as the output in prettify. I have experimented writing my own formatter function, and I'm able to detect if an   is there, but unsure what to return from the function, so that prettify will output the  . See 'Example 3' below for my dummy formatter to detect &nsbp.
Here is a dummy example demonstrating the problem:
EXAMPLE 1: Using prettify without a formatter
from bs4 import BeautifulSoup
hello = '<span>Привет, мир</span>'
soup = BeautifulSoup(hello, 'html.parser')
print("\nBefore prettify:\n{}".format(soup))
soup = soup.prettify()
print("\nAfter prettify:\n{}".format(soup))
Output - Cyrillic characters are fine, but   are converted to ws
Before prettify:
<span>Привет, мир</span>
After prettify:
<span>
Привет, мир
</span>
EXAMPLE 2: Using prettify with formatter='html'
from bs4 import BeautifulSoup
hello = '<span>Привет, мир</span>'
soup = BeautifulSoup(hello, 'html.parser')
print("\nBefore prettify:\n{}".format(soup))
soup = soup.prettify(formatter='html')
print("\nAfter prettify:\n{}".format(soup))
output:   are preserved, but Cyrillic characters get converted unreadable
Before prettify:
<span>Привет, мир</span>
After prettify:
<span>
Привет, мир
</span>
Example 3: Supplying a custom formatter. This is just a dummy formatter for the sake of the example, to detect if   is there. What should I return from this function, if I want   to be preserved? (p.s., it seems   are parsed as \xa0, which is why I'm checking for it this way)
def check_for_nbsp(str):
if '\xa0' in str:
return str+" <-- HAS"
else:
return str+" <-- DOESN'T HAVE"
hello = '<span>Привет, мир</span>'
soup = BeautifulSoup(hello, 'html.parser')
print("\nBefore prettify:\n{}".format(soup))
soup = soup.prettify(formatter=check_for_nbsp)
print("\nAfter prettify:\n{}".format(soup))
Output:
Before prettify:
<span>Привет, мир</span>
After prettify:
<span>
Привет, мир <-- HAS
</span>
Is there a way to get the best of both worlds - preserve the   AND the Cyrillic characters? Alternatively, is there a realiable python package that prettifies HTML other than BeautifulSoup?
Here is a previous Stackoverflow question I posted regarding mangling the Cyrillic characters - that's what led me to understand I should remove the formatter='html' option, unfortunately this removes the   characters, which is equally as problematic.