Python 3: my unicode2shift-jis script works except writes ASCII file. Why?

Question

I have a file with Unicode Japanese writing in it and I want to convert it to Shift-JIS and print it out to Shift-JIS encoded file. I do this:

with open("unikanji.txt", 'rb') as unikanjif:
    unikanji = unikanjif.read()

sjskanji = unikanji.decode().encode('shift-jis')

with open("kanji.txt", 'wb') as sjskanjif:
    sjskanjif.write(sjskanji)

It works except that when I open kanji.txt it always opens as an Ansi file, not Shift-JIS, and I see misc characters instead of Japanese. If I manually change the file encoding to Shift-JIS then the misc characters turn into the right Japanese characters. How do I make my program create the file as Shift-JIS to begin with?

This is an issue with how you're opening the output file to read it, not with your code, which writes out shift-jis encoded text just fine. Whatever text editor you're using doesn't detect the encoding correctly, but that's not a problem with the file or with the code that creates it. — Blckknght, Nov 01 '16 at 20:53

score 0 · Answer 1 · answered Nov 02 '16 at 17:35

"ANSI" is Microsoft's term for the default, localized encoding, which varies according to the localized version of Windows used. A Microsoft program like Notepad assumes "ANSI" for the encoding of a text file unless it starts with a byte order mark. Microsoft Notepad recogizes UTF-8, UTF-16LE and UTF-16BE BOMs.

Shift-JIS is a localized encoding, so you have to use an editor such as Notepad++ and manually configure it to Shift-JIS, as you have discovered. The file as you have written it is Shift-JIS-encoded, but unless the editor you use has some heuristic to detect the encoding it will have to be manually configured. You could also use Japanese Windows or change your localization default in your current Windows version and Shift-JIS might be the ANSI default.

By the way, converting encodings can be a little more straightforward. Below assumes the original file is UTF-8 and the target file will be shift-jis. utf-8-sig automatically handles and removes a byte order mark, if present.

with open('unikanji.txt',encoding='utf-8-sig') as f:
    text = f.read()

with open('kanji.txt','w',encoding='shift-jis') as f:
    f.write(text)

Python 3: my unicode2shift-jis script works except writes ASCII file. Why?

1 Answers1