3

Running the following code seems to generate the wrong values:

byte[] data = "\u00a5".getBytes("Shift_JIS");

It produces [ -4, -4 ], but I expect [ 0x5c ]

I've tried various alternative names, "Shift-JIS", "shift_jis", "cp932" and all produce the same result.

When I feed the resulting data into the Shift-JIS decoder, I get an exception: java.nio.charset.UnmappableCharacterException: Length: 2

That is, with the decoder configured as follows:

Charset charset = Charset.forName("Shift_JIS);
        CharsetDecoder decoder = charset.newDecoder()
                .onMalformedInput(CodingErrorAction.REPORT)
                .onUnmappableCharacter(CodingErrorAction.REPORT);

But given the output of the encoder looks wrong, my guess is that the decoder is irrelevant. My point is that regardless of the actual bytes, the encoder generates data that it can't decode.

The full width Yen (U+FFE5) encodes to [ -127 (0x81), -113 (0x8F) ], and decodes correctly.

Strangely, if I try to decode [ 92 (0x5C) ] which is what I think the Shift-JIS encoding of the single width Yen is, the Android/Java decoder produces a back slash, leaving the character as 92.

If the encoder didn't support a given character, I would expect a replacement character such as '?'. But -4 (0xFC) doesn't even seem to be valid Shift-JIS. It's not even the Unicode replacement character U+FFFD. Using the following line I can see that the encoder seems to be configured to use [-4, -4]:

Charset.forName("Shift_JIS").newEncoder().replacement()
  • So why isn't the single width Yen mapped in Shift-JIS?
  • Is [-4, -4] a sensible encoder replacement?
  • Why doesn't the decoder support 0x5C mapping to Yen (U+00A5)?
  • If 0x5C is not the correct encoding, what is?
StephenD
  • 3,662
  • 1
  • 19
  • 28

1 Answers1

4

A partial answer: back when Microsoft created its east-Asian code pages for Windows, like the Japanese code page 932 and Korean 949, they made the byte 0x5C render as a currency symbol (either a Yen sign or Won sign respectively) while still syntactically acting as a backslash character in file paths (so that a file path on a Japanese system might look like

C:¥Documents¥something.doc

). Thus the byte was in a sense a Yen sign, but also in a sense a backslash; the same byte was even rendered as a different one of these symbols depending upon the font when on a Japanese system, according to http://archives.miloush.net/michkap/archive/2005/09/17/469941.html.

The lack of a consistent meaning of the symbol within the encoding means that while a Shift-JIS encoder can sensibly map both \ and ¥ to the byte 0x5C, a decoder trying to map a Shift-JIS-encoded string to a sequence of unicode code points has no way of knowing whether to convert the byte 0x5C to a backslash or to a yen sign; Japanese users used to make that choice via their font selection (if they were able to make it at all).

In the face of this unfixable ambiguity, all decoders seem to choose to decode 0x5C to a backslash. (At least, Python does this, and the WhatWG have a spec that dictates it.)

As for the details of what Java/Android in particular are doing when asked to encode a Yen sign in shift_jis, I'm afraid I don't know.

Mark Amery
  • 143,130
  • 81
  • 406
  • 459