¥ character transforms to \ after encoding/decoding in Shift-JIS

Question

How is this possible? Is it a bug? (Same behavior in Python 2.7.12 and Python 3.5.1.)

In [1]: yen = u'\u00A5'

In [2]: print(yen)
¥

In [3]: yen_after_encoding_decoding = yen.encode('shift-jis').decode('shift-jis')

In [4]: print(yen_after_encoding_decoding)
\

In [5]: yen
Out[5]: '¥'

In [6]: yen_after_encoding_decoding
Out[6]: '\\'

In [7]:

The shift-jis encoding for yen is the same as the ASCII encoding for backslash, so presumably that's related. But still weird!

I remember there was exactly this code page confusion somewhere two decades ago, and Japanese Windows users use the ¥ sign instead of the backslash in paths because of it. Started as a bug, became a feature, too late to change now. Not sure if that's still the case in Windows 10…? — deceze, Oct 13 '16 at 14:43

score 0 · Answer 1 · answered Jul 14 '20 at 16:30

Character set of Shift_JIS is defined in JIS (Japanese Industrial Standard). Character encoding Shift_JIS uses JIS X 0201 for half-width character set, and JIS X 0208 for full-width character set.

"backslash" in the question mean the half-width backslash in ISO/IEC 8859-1(Latin-1), and is represented as 0x005C. On the other hand, JIS X 0201 (half-width character set) doesn't contain backslash (see https://en.wikipedia.org/wiki/JIS_X_0201). It uses yen sign instead of backslash in 0x005C. Many Japanese applications (ex. Windows Explorer in Japanese locale) uses yen sign as equivalent of backslash like C:¥Windows.

In this situation, the behavior of the code looks like this:

yen is U+00A5 in Unicode.
yen.encode('shift-jis') is 0x005C in bytes, because ¥ is contained in Shift-JIS and is 0x005C in its encoding.
.decode('shift-jis') convert 0x005C to U+005C (half-width backslash) in Unicode, because ¥ is used as equivalent of backslash.

¥ character transforms to \ after encoding/decoding in Shift-JIS

1 Answers1