Converting widechars to system ANSI encoding in Python

Question

I am currently trying to make my screen reader work better with Becky! Internet Mail. The problem which I am facing is related to the list view in there. This control is not Unicode aware but the items are custom drawn on screen so when someone looks at it content of all fields regardless of encoding looks okay. When accessed via MSAA or UIA however basic ANSI chars and mails encoded with the code page set for non Unicode programs have they text correct whereas mails encoded in Unicode do not. Samples of the text :

Zażółć gęślą jaźń

is represented by:

ZaĹĽĂłĹ‚Ä‡ gÄ™Ĺ›lÄ… jaĹşĹ„ In this case it is damaged CP1250 as per answer below. However: ⚠️

is represented by: âš ď¸Ź

⏰ is represented by: âŹ° and 高生旺 is represented by: é«ç”źć—ş

I've just assumed that these strings are damaged beyond repair, however when unicode beta support in windows 10 is enabled they are exposed correctly.

Is it possible to simulate this behavior in Python?

The solution needs to work in both Python 2 and 3.

At the moment I am simply replacing known combinations of these characters with their proper representations, but it is not very good solution, because lists containing replacements and characters to replace needs to be updated with each new discovered character.

ANSI is not a well-defined term in this context. What you have seems to be Windows code page 1250. You should probably read the [Stack Overflow `character-encoding` tag info page](/tags/character-encoding/info) for background. — tripleee, Dec 30 '19 at 12:45

Trapli · Accepted Answer · 2019-12-30T12:42:35.500

your utf-8 is decoded as cp1250.

What I did in python3 is this:

orig = "Zażółć gęślą jaźń"
wrong = "ZaĹĽĂłĹ‚Ä‡ gÄ™Ĺ›lÄ… jaĹşĹ„"

for enc in range(437, 1300):
    try:
        res = orig.encode().decode(f"cp{enc}")
        if res == wrong:
            print('FOUND', res, enc)
    except:
        pass

...and the result was the 1250 codepage.

So your solution shall be:

import sys

def restore(garbaged):
    # python 3
    if sys.version_info.major > 2:
        return garbaged.encode('cp1250').decode()
    # python 2
    else:
        # is it a string
        try:
            return garbaged.decode('utf-8').encode('cp1250')
        # or is it unicode
        except UnicodeEncodeError:
            return garbaged.encode('cp1250')

EDIT:

The reason why "高生旺" can not be recovered from é«ç”źć—ş:

"高生旺".encode('utf-8') is b'\xe9\xab\x98\xe7\x94\x9f\xe6\x97\xba'.

The problem is the \x98 part. In cp1250 there is no character set for that value. If you try this:

"高生旺".encode('utf-8').decode('cp1250')

You will get this error: UnicodeDecodeError: 'charmap' codec can't decode byte 0x98 in position 2: character maps to <undefined>

The way to get "é«ç”źć—ş" is:

"高生旺".encode('utf-8').decode('cp1250', 'ignore')

But the ignore part is critical, it causes data loss:

'é«ç”źć—ş'.encode('cp1250') is b'\xe9\xab\xe7\x94\x9f\xe6\x97\xba'.

If you compare these two:

b'\xe9\xab\xe7\x94\x9f\xe6\x97\xba'
b'\xe9\xab\x98\xe7\x94\x9f\xe6\x97\xba'

you will see that the \x98 character is missing so when you try to restore the original content, you will get a UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 0-1: invalid continuation byte.

If you try this:

'é«ç”źć—ş'.encode('cp1250').decode('utf-8', 'backslashreplace')

The result will be '\\xe9\\xab生旺'. \xe9\xab\x98 could be decoded to 高, from \xe9\xab it is not possible.

It works when done in the interpreter, however when done in a plugin in the screen reader the orginal string is received as Unicode and it fails as follows: File "encodings\utf_8.pyo", line 16, in decode UnicodeEncodeError: 'ascii' codec can't encode characters in position 2-9: ordinal not in range(128) — lukaszgo1, Oct 18 '19 at 08:24
@lukaszgo1 updated the code so it works now for unicode chars in python2 too — Trapli, Oct 18 '19 at 08:53
While it works for the string from my original question it fails for others. I've updated my question accordingly. — lukaszgo1, Dec 28 '19 at 18:53
```python print(restore(u"test âš\xa0 âŹ°")) ``` results `test ⚠ ⏰` for me both in python 2 and 3 — Trapli, Dec 29 '19 at 20:37
Okay it definitely works. I was trying in CMD hence the failures. Any chance of a solution for a Chinese string from my question? I'll wait a few days for an one, and if there is none will accept your answer. — lukaszgo1, Dec 29 '19 at 23:00

Converting widechars to system ANSI encoding in Python

1 Answers1