Prettify with BeautifulSoup using a formatter that will preserve AND Cyrillic characters?

Question

I'm generating some HTML with python and BeautifulSoup4. At the end, I'd like to prettify the generated HTML. If I prettify as follows:

soup.prettify()

BeautifulSoup converts all the &nbsp characters to spaces. Unfortunately, my webpage relies on having these &nbsp characters. After some guidance, I realized that this can be overcome by supplying a formatter to prettify:

soup.prettify(formatter='html')

Unfortunately, when I do this, though the &nbsp characters are preserved, BeautifulSoup encodes the Cyrillic (Russian) characters in my HTML, making them unreadable to me. This leaves the formatter='html' option off limits to me.

(formatter='minimal' and formatter=None also don't work; they leave Cyrillic alone, but take away the &nbsp.)

After looking at the BeautifulSoup docs, I realized you can specify your own custom formatter using BeautifulSoup's Formatter class. Unfortunately, I am unsure how this class works. I have tried to find documentation for the Formatter class but I am unable. Does anyone know if it's possible to write a custom formatter, that will tell BeautifulSoup to preserve &nbsp characters (and leave my Cyrillic characters alone)? Or, is there any documentation for how this class works exactly? There are some examples in that section of the BS documentation, but after reading them, I am still unclear how to accomplish what I'm trying to accomplish.

EDIT: I have found different documentation, which makes it much clearer. The custom formatter is just a function you pass to the 'formatter' arg (i.e. prettify(formatter=my_func), where my_func is a function you define on your own); it gets called once for every String and attribute value encountered, passing that value to the function and using whatever the function returns as the output in prettify. I have experimented writing my own formatter function, and I'm able to detect if an &nbsp is there, but unsure what to return from the function, so that prettify will output the &nbsp. See 'Example 3' below for my dummy formatter to detect &nsbp.

Here is a dummy example demonstrating the problem:

EXAMPLE 1: Using prettify without a formatter

from bs4 import BeautifulSoup
hello = '<span>Привет,&nbspмир</span>'
soup = BeautifulSoup(hello, 'html.parser')
print("\nBefore prettify:\n{}".format(soup))
soup = soup.prettify()
print("\nAfter prettify:\n{}".format(soup))

Output - Cyrillic characters are fine, but &nbsp are converted to ws

Before prettify:
<span>Привет, мир</span>

After prettify:
<span>
 Привет, мир
</span>

EXAMPLE 2: Using prettify with formatter='html'

from bs4 import BeautifulSoup
hello = '<span>Привет,&nbspмир</span>'
soup = BeautifulSoup(hello, 'html.parser')
print("\nBefore prettify:\n{}".format(soup))
soup = soup.prettify(formatter='html')
print("\nAfter prettify:\n{}".format(soup))

output: &nbsp are preserved, but Cyrillic characters get converted unreadable

Before prettify:
<span>Привет, мир</span>

After prettify:
<span>
 &Pcy;&rcy;&icy;&vcy;&iecy;&tcy;,&nbsp;&mcy;&icy;&rcy;
</span>

Example 3: Supplying a custom formatter. This is just a dummy formatter for the sake of the example, to detect if &nbsp is there. What should I return from this function, if I want &nbsp to be preserved? (p.s., it seems &nbsp are parsed as \xa0, which is why I'm checking for it this way)

def check_for_nbsp(str):
    if '\xa0' in str:
        return str+" <-- HAS"
    else:
        return str+" <-- DOESN'T HAVE"

hello = '<span>Привет,&nbspмир</span>'
soup = BeautifulSoup(hello, 'html.parser')
print("\nBefore prettify:\n{}".format(soup))
soup = soup.prettify(formatter=check_for_nbsp)
print("\nAfter prettify:\n{}".format(soup))

Output:

Before prettify:
<span>Привет, мир</span>

After prettify:
<span>
 Привет, мир <-- HAS
</span>

Is there a way to get the best of both worlds - preserve the &nbsp AND the Cyrillic characters? Alternatively, is there a realiable python package that prettifies HTML other than BeautifulSoup?

Here is a previous Stackoverflow question I posted regarding mangling the Cyrillic characters - that's what led me to understand I should remove the formatter='html' option, unfortunately this removes the &nbsp characters, which is equally as problematic.

bikz · Accepted Answer · 2021-10-31T21:40:58.603

I was able to solve this problem. I discovered in these docs, about the EntitySubstitution class in the bs4.dammit module. It implements Beautiful Soup’s standard formatters as class methods - the “html” formatter (which preserves &nbsp chars) is EntitySubstitution.substitute_html. This will allow you to get that formatter behavior, but then do extra things.

(p.s., &nbsp are parsed in BeautifulSoup as \xa0)

Here is the code:

from bs4 import BeautifulSoup
from bs4.dammit import EntitySubstitution # don't miss this import statement!

'''
this is the custom formatter.
prettify will call this function every String and attribute value encountered;
it is going to display whatever you return, in the prettified output

Strategy:
 - Split the string on &nbsp characters.
 - For portion that's not &nbsp - return as is.
 - For portion that's &nbsp - run it through EntitySubstitution.substitute_html,
   which will preserve the &nbsp)
'''
def preserve_nbsp_and_ru(str):
    newstr = ""
    split_str = str.split('\xa0') # &nbsp are parsed as \xa0 in BS
    # (this will split a&nbspb&nsbp&c --> [a,b,c])
    for i, space_between in enumerate(split_str):
        # space_between will be regular text, preserve it as is
        newstr += space_between
        # add an &nbsp after it, unless you're on the last
        # item in the list, after which there would not be an &nbsp
        if i < len(split_str) - 1:
            # put the nbsp through the EntitySubstitution function
            # which will preserve it
            newstr += EntitySubstitution.substitute_html('\xa0')
    return newstr

hello = '<span>Привет,&nbspмир</span>'
soup = BeautifulSoup(hello, 'html.parser')
print("\nBefore prettify:\n{}".format(soup))
soup = soup.prettify(formatter=preserve_nbsp_and_ru)
print("\nAfter prettify:\n{}".format(soup))

Output:

Before prettify:
<span>Привет, мир</span>

After prettify:
<span>
 Привет,&nbsp;мир
</span>

Prettify with BeautifulSoup using a formatter that will preserve AND Cyrillic characters?

1 Answers1

Linked