I'm using ftfy
to fix broken UTF-8
encoding that shows as CP1252
and convert it to UTF-8
cyrillic, but I've found that some letters can't be fixed.
I have a string Ð'010СС199
that I convert to bytes and define pairs b"\xc3\x90'010\xc3\x90\xc2\xa1\xc3\x90\xc2\xa1199"
where:
\xc3\x90' -> \xd0\x92 -> Cyrillic В
\xc3\x90\xc2\xa1\ -> \xd0\xa1 -> cyrillic С
As you can see Ð'
length is 2. ord
won't work in this case.
For using slice
I must know where is start
and end
.
Translate
also doesn't work here.
Previously I've used simple string replacement, but now I'd like to improve my method and exclude mistakes.
Original Ð'010СС199
-> conversion -> outputВ010СС199
EDIT:
str = "Ð'010СС199"
str_to_bytes = str.encode("UTF-8")
print(str_to_bytes)
# UTF-8 bytes
# \xc3\x90\xc2\xa0 : \xd0\xa0 -> cyrillic Р
# \xc3\x90\xc2\xa1 : \xd0\xa1 -> cyrillic С
# \xc3\x90\xe2\x80\x94' : \xd0\x97 -> cyrillic З
# \xc3\x90' : \xd0\x92 -> Cyrillic В
test_str = b"\xc3\x90'010\xc3\x90\xc2\xa1\xc3\x90\xc2\xa1199"
t1 = test_str.replace(b'\xc3\x90\xc2\xa1', b'\xd0\xa1')
print(t1)
dict_cyr = {"Ð'": "P",
"С":"C"}
t2 = test_str.translate(test_str)
print(t2)
I can explain how I received results. 1. I used 2cyr.com decoder. But even it failed in some cases. 2. I have a manually translated strings, so I compared them and define what byte corresponds to cyrillic letter with help of UTF-8 chartable.