3

How is this possible? Is it a bug? (Same behavior in Python 2.7.12 and Python 3.5.1.)

In [1]: yen = u'\u00A5'

In [2]: print(yen)
¥

In [3]: yen_after_encoding_decoding = yen.encode('shift-jis').decode('shift-jis')

In [4]: print(yen_after_encoding_decoding)
\

In [5]: yen
Out[5]: '¥'

In [6]: yen_after_encoding_decoding
Out[6]: '\\'

In [7]:

The shift-jis encoding for yen is the same as the ASCII encoding for backslash, so presumably that's related. But still weird!

deceze
  • 510,633
  • 85
  • 743
  • 889
DavidC
  • 1,409
  • 10
  • 25
  • I remember there was exactly this code page confusion somewhere two decades ago, and Japanese Windows users use the ¥ sign instead of the backslash in paths because of it. Started as a bug, became a feature, too late to change now. Not sure if that's still the case in Windows 10…? – deceze Oct 13 '16 at 14:43
  • 1
    See also https://stackoverflow.com/q/33726867/5320906 – snakecharmerb Oct 15 '17 at 12:40

1 Answers1

0

Character set of Shift_JIS is defined in JIS (Japanese Industrial Standard). Character encoding Shift_JIS uses JIS X 0201 for half-width character set, and JIS X 0208 for full-width character set.

"backslash" in the question mean the half-width backslash in ISO/IEC 8859-1(Latin-1), and is represented as 0x005C. On the other hand, JIS X 0201 (half-width character set) doesn't contain backslash (see https://en.wikipedia.org/wiki/JIS_X_0201). It uses yen sign instead of backslash in 0x005C. Many Japanese applications (ex. Windows Explorer in Japanese locale) uses yen sign as equivalent of backslash like C:¥Windows.

In this situation, the behavior of the code looks like this:

  • yen is U+00A5 in Unicode.
  • yen.encode('shift-jis') is 0x005C in bytes, because ¥ is contained in Shift-JIS and is 0x005C in its encoding.
  • .decode('shift-jis') convert 0x005C to U+005C (half-width backslash) in Unicode, because ¥ is used as equivalent of backslash.
SATO Yusuke
  • 1,600
  • 15
  • 39