Fixing file encoding

Question

Today I ordered a translation for 7 different languages, and 4 of them appear to be great, but when I opened the other 3, namely Greek, Russian, and Korean, the text that was there wasn't related to any language at all. It looked like a bunch of error characters, like the kind you get when you have the wrong encoding on a file.

For instance, here is part of the output of the Korean translation:

½Ì±ÛÇÃ·¹ÀÌ¾î

¸ÖÆ¼ÇÃ·¹ÀÌ¾î

¿É¼Ç

I may not even speak a hint of Korean, but I can tell you with all certainty that is not Korean.

I assume this is a file encoding issue, and when I open the file in Notepad, the encoding is listed as ANSI, which is clearly a problem; the same can be said for the other two languages.

Does anyone have any ideas on how to fix the encoding of these 3 files; I requested the translators reupload in UTF-8, but in the meantime, I thought I might try to fix it myself.

If anyone is interested in seeing the actual files, you can get them from my Dropbox.

D.Shawley · Accepted Answer · 2015-07-02T21:21:02.287

If you look at the byte stream as pairs of bytes, they look vaguely Korean but I cannot tell of they are what you would expect or not.

bash$ python3.4
Python 3.4.3 (v3.4.3:b4cbecbc0781, May 30 2015, 15:45:01)
[GCC 4.2.1 Compatible Apple LLVM 6.1.0 (clang-602.0.53)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> buf = '½Ì±ÛÇÃ·¹ÀÌ¾î'
>>> [hex(ord(b)) for b in buf]
>>> ['0xbd', '0xcc', '0xb1', '0xdb', '0xc7', '0xc3', '0xb7', '0xb9', '0xc0', '0xcc', '0xbe', '0xee']
>>> u'\uBDCC\uB1DB\uC7C3\uB7B9\uC0CC\uBEEE'
'뷌뇛쟃랹샌뻮'

Your best bet is to wait for the translator to upload UTF-8 versions or have them tell you the encoding of the file. I wouldn't make the assumption that they bytes are simply 16 bit characters.

Update

I passed this through the chardet module and it detected the character set as EUC-KR.

>>> import chardet
>>> chardet.detect(b'\xBD\xCC\xB1\xDB\xC7\xC3\xB7\xB9\xC0\xCC\xBE\xEE')
{'confidence': 0.833333333333334, 'encoding': 'EUC-KR'}
>>> b'\xBD\xCC\xB1\xDB\xC7\xC3\xB7\xB9\xC0\xCC\xBE\xEE'.decode('EUC-KR')
'싱글플레이어'

According to Google translate, the first line is "Single Player". Try opening it with Notepad and using EUC-KR as the encoding.

Fixing file encoding

1 Answers1