Python: gb2312 codec can't decode bytes

Question

I have a word-encoded string from received mail. When parsing encoded word in Python3, I got an exception

'gb2312' codec can't decode bytes in position 18-19: illegal multibyte sequence

raised from make_header method.

from email.header import decode_header, make_header

hdr = decode_header("""=?gb2312?B?QSBWIM34IMXMILP2IMrbICAgqEMgs8kgyMsg?=""")
make_header(hdr)

Parsing encoded string in online tools works without problems (http://dogmamix.com/MimeHeadersDecoder/). Any suggestions what I am doing wrong? Thanks

score 2 · Accepted Answer · answered Sep 27 '16 at 07:47

The error message tells you that the bytes in position 18-19 are not valid for this encoding.

decode_header simply extracts a bunch of bytes and an encoding. make_header actually attempts to interpret those bytes in that encoding, and fails, because these bytes are not valid in that encoding.

Similarly,

bash$ base64 -D <<<'QSBWIM34IMXMILP2IMrbICAgqEMgs8kgyMsg' |
> iconv -f gb2312 -t utf-8
A V 网 盘 出 售   
iconv: (stdin):1:18: cannot convert

So the error message simply tells you that this data is not valid. We cannot tell without more information what the data should be, and neither can Python or your program do that.

For a rough parable, you can g??ss which b?t?s are m?ss?ng here, but not in ?h?? l?ng?? s???e???.

Maybe the encoded word is really rubbish. I got confused by that online tool which (maybe) displayed that string correctly. Also I got the same result from Outlook. — Patrik Polakovic, Sep 27 '16 at 09:06
Looks like the tool you linked to faithfully decodes it to an undisplayable character. It would be nice if it displayed an error or an "unknown character" glyph but it simply implements "garbage in, garbage out". — tripleee, Sep 27 '16 at 09:13

Python: gb2312 codec can't decode bytes

1 Answers1