Convert non UTF-8 ASCII literals in otherwise UTF-8 text to their respective character

Question

I have a UTF8 encoded text that has been mangled and contains some 'cp1252' ASCII literals. I am trying to isolate the literals and convert them one by one, however following code does not work and I can't understand why...

text = "This text contains some ASCII literal codes like \x9a and \x9e."

# Find all ASCII literal codes in the text
codes = re.findall(r'\\x[0-9a-fA-F]{2}', text)

# Replace each ASCII literal code with its decoded character
for code in codes:
    char = bytes(code, 'ascii').decode('cp1252')
    text = text.replace(code, char)

print(text)

It's *really* mangled if it has a 4-character escape sequence instead of the byte the escape sequence is supposed to represent. — chepner, Feb 12 '23 at 21:31
`text` does not contain the 4-character sequence ``\``, `x`, `9`, `a`; it contains the single Unicode character U+009A. — chepner, Feb 12 '23 at 21:33

Mark Tolonen · Accepted Answer · 2023-02-12T21:39:23.283

No regex needed. Encoding in latin1 converts 1:1 from Unicode code points U+0000 to U+00FF to bytes b'\x00' to b'\xff'. Then decode correctly:

>>> text = "This text contains some ASCII literal codes like \x9a and \x9e."
>>> text.encode('latin1').decode('cp1252')
'This text contains some ASCII literal codes like š and ž.'

The text was probably decoded as ISO-8859-1 (another name for Latin-1) in the first place. Ideally fix the that code to decode as cp1252.

Convert non UTF-8 ASCII literals in otherwise UTF-8 text to their respective character

1 Answers1