Python encodes (Korean) characters in an unexpected way with euc-kr encoding (codecs, encodings module)

Question

I tried to read some Korean text file encoded in 'euc-kr' in python but had some errors raised. After inspecting encodings module for a while, I learned that this module encodes Korean characters seemingly very weird way. Let me take an example

Korean character 탇 (which is an rarely used character, but i need this for pronunciation dictionary) is supposed to be encoded to B5 6E according to EUC-KR spec (I referred to this site). But encodings module gives me somewhat different result.

# python3
>> from encodings import euc_kr
>> euc_kr.codec.decode(b'\xB5\x6E')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'euc_kr' codec cant decode byte 0xb5 in position 0: illegal multibyte sequence
>> euc_kr.codec.encode('탙')
(b'\xa4\xd4\xa4\xbc\xa4\xbf\xa4\xbc', 1)

As you can see, I get an error when I try to decode B5 6E and euc_kr.codec.encode gives me longer bytes than I expected. I have no clue what's happening there. How can I avoid raising an error when I decode B5 6E(and many other Korean characters)? Is there another document about EUC-KR specification that I can read to understand how python implementation of EUC-KR work?

Yes, that is odd. I'm not familiar with the euc encodings, but I don't understand why `euc_kr.codec.encode('탙')` results in so many bytes when euc_kr is supposed to encode each codepoint in 1 or 2 bytes. BTW, you don't need to use `euc_kr.codec.encode(s)`, you can just do `s.encode('euc_kr')`. — PM 2Ring, Oct 16 '17 at 14:06

score 4 · Accepted Answer · answered Oct 16 '17 at 15:55

4

It looks like the euc_kr result is some kind of decomposition. You might try cp949, which according to Wikipedia:

The default Korean codepage for Windows (code page 949) is a proprietary, but upward compatible extension of EUC-KR...

Some experimentation:

>>> s = '탇'
>>> ud.name(s)
'HANGUL SYLLABLE TAD'
>>> s.encode('euc_kr')
b'\xa4\xd4\xa4\xbc\xa4\xbf\xa4\xa7'
>>> s.encode('euc_kr').decode('cp949')
'ㅤㅌㅏㄷ'
>>> for c in s.encode('euc_kr').decode('cp949'):
...     print(ud.name(c))
...     
HANGUL FILLER
HANGUL LETTER THIEUTH
HANGUL LETTER A
HANGUL LETTER TIKEUT
>>> s.encode('cp949').hex()
'b56e'

answered Oct 16 '17 at 15:55

Mark Tolonen

166,664
26
169
251

1

Thank you so much. Everything is clear to me now. For your information, each Korean character consists of chosung, joongsung (, and jongsung optionally) but the grammar doesn't allow all possible combinations of the three. For example among '탓' and '탇', Korean grammar only allows '탓' as a valid Korean character. But there are some cases where irregular Korean characters are allowed, like in pronunciation symbols. Actually, in the pronunciation symbol system, '탓' is not valid, but '탇' is. – Oct 16 '17 at 17:26

Python encodes (Korean) characters in an unexpected way with euc-kr encoding (codecs, encodings module)

1 Answers1

Linked