I tried to read some Korean text file encoded in 'euc-kr' in python but had some errors raised. After inspecting encodings
module for a while, I learned that this module encodes Korean characters seemingly very weird way. Let me take an example
Korean character 탇 (which is an rarely used character, but i need this for pronunciation dictionary) is supposed to be encoded to B5 6E according to EUC-KR spec (I referred to this site). But encodings module gives me somewhat different result.
# python3
>> from encodings import euc_kr
>> euc_kr.codec.decode(b'\xB5\x6E')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'euc_kr' codec cant decode byte 0xb5 in position 0: illegal multibyte sequence
>> euc_kr.codec.encode('탙')
(b'\xa4\xd4\xa4\xbc\xa4\xbf\xa4\xbc', 1)
As you can see, I get an error when I try to decode B5 6E and euc_kr.codec.encode
gives me longer bytes than I expected. I have no clue what's happening there. How can I avoid raising an error when I decode B5 6E(and many other Korean characters)? Is there another document about EUC-KR specification that I can read to understand how python implementation of EUC-KR work?