Replacing byte in bytes array to fix encoding

Question

I'm using ftfy to fix broken UTF-8 encoding that shows as CP1252 and convert it to UTF-8 cyrillic, but I've found that some letters can't be fixed.

I have a string Ð'010Ð¡Ð¡199 that I convert to bytes and define pairs b"\xc3\x90'010\xc3\x90\xc2\xa1\xc3\x90\xc2\xa1199" where:

\xc3\x90' -> \xd0\x92 -> Cyrillic В
\xc3\x90\xc2\xa1\ -> \xd0\xa1 -> cyrillic С

As you can see Ð' length is 2. ord won't work in this case.

For using slice I must know where is start and end.

Translate also doesn't work here.

Previously I've used simple string replacement, but now I'd like to improve my method and exclude mistakes.

Original Ð'010Ð¡Ð¡199 -> conversion -> outputВ010СС199

EDIT:

    str = "Ð'010Ð¡Ð¡199"
    str_to_bytes = str.encode("UTF-8")
    print(str_to_bytes)
    # UTF-8 bytes
    # \xc3\x90\xc2\xa0 : \xd0\xa0 -> cyrillic Р
    # \xc3\x90\xc2\xa1 : \xd0\xa1 -> cyrillic С
    # \xc3\x90\xe2\x80\x94' : \xd0\x97 -> cyrillic З
    # \xc3\x90' : \xd0\x92 -> Cyrillic В
    test_str = b"\xc3\x90'010\xc3\x90\xc2\xa1\xc3\x90\xc2\xa1199"
    t1 = test_str.replace(b'\xc3\x90\xc2\xa1', b'\xd0\xa1')
    print(t1)
    dict_cyr = {"Ð'": "P",
                "Ð¡":"C"}
    t2 = test_str.translate(test_str)
    print(t2)

I can explain how I received results. 1. I used 2cyr.com decoder. But even it failed in some cases. 2. I have a manually translated strings, so I compared them and define what byte corresponds to cyrillic letter with help of UTF-8 chartable.

I can, but it's useless anyway. I just convert string and manually define bytes pairs. — Rostislav Aleev, Feb 11 '19 at 10:51
What I'm thinking about is to use `list[str_to_bytes]` and use decimal values. Because `\xc3\x90` looks like a control character. — Rostislav Aleev, Feb 11 '19 at 11:05
What you have is a UTF-8 - CP1252 Mojibake, and recovering the missing bytes is not going to be straightforward. UTF-8 pairs follow a specific pattern, but not all UTF-8 bytes have CP1252 equivalents. When those are missing, you have to *guess* what can replace them. — Martijn Pieters, Feb 11 '19 at 11:15
You said you were working with ftfy, how are you using that? Do you have the original binary data? You show a `str` object and a `test_str` value. — Martijn Pieters, Feb 11 '19 at 11:16
And you specifically used the term 'bytes array' in your title, so I was assuming you already were using the [`bytearray` type](https://docs.python.org/3/library/stdtypes.html#bytearray). — Martijn Pieters, Feb 11 '19 at 11:20
@MartijnPieters, I'm sorry for misleading. I found a way to use `bytes.hex()`. Now I have a string and can use `re` again. I know that I have to guess, but their representation in bytes is always the same. Original data is in SQL DB, I work with exported data to excel spreadsheet (sic!), fix encoding and may be in future it must be imported to SQL again. Problem that as you say original source is an application that uses CP1252 instead of UTF8. — Rostislav Aleev, Feb 11 '19 at 11:40
Now it's really simple. Original hex `543335374bc390c5be373737` output hex `543335374bd09e373737` and I use `re.findall()` + `str.replace()` — Rostislav Aleev, Feb 11 '19 at 12:47
I'm trying to determine where this [Mojibake](https://en.wikipedia.org/wiki/Mojibake) originated. It could be that you connected to the SQL database with an incorrect codec setting. It could be that the fault is with the application inserting the data. The missing bytes could well still be there. — Martijn Pieters, Feb 11 '19 at 13:37
`bytes.hex()` is not the most productive way to solve this, but if it works for you, then so be it. You can use regular expressions and `bytes.replace()` with `bytes` objects just fine, no need to transform the contained data into strings first. — Martijn Pieters, Feb 11 '19 at 13:38

Serge Ballesta · Accepted Answer · 2019-02-11T14:14:21.923

A common problem in encoding/decoding is encoding a string in utf-8 and later decoding the bytestring as if it were cp1252 (often because of a stupid windows app).

It could be what happens here, because CYRILLIC CAPITAL LETTER VE ('В' or '\u0412') and CYRILLIC CAPITAL LETTER ES (or) respectively translate as:

>>> '\u0412'.encode().decode('cp1252')
'Ð’'
>>> '\u0421'.encode().decode('cp1252')
'Ð¡'

Which is close from your original string, except that my transformation uses a RIGHT SINGLE QUOTATION MARK (’ or U+2019) while your string contains an APOSTROPHE (' or U+0027).

If the string actually contains an APOSTROPHE, it could be caused by an attempt of filtering non latin characters from a cp1252 encoded string. The downside is that it is hard to guess whether the apostrophe is a true one or a filtered right single quotation mark.

If it does contain a single quotation mark, then it can be transformed back as simply as:

>>> 'Ð’010Ð¡Ð¡199'.encode('cp1252').decode()
'В010СС199'

I just have no words. Accepted. – Rostislav Aleev Feb 11 '19 at 13:53 — Rostislav Aleev, Feb 11 '19 at 13:53

Replacing byte in bytes array to fix encoding

1 Answers1