Unexpected conversion of Unicode overline (U+203E) to Shift-JIS

Question

For a customer project, a query is made against a DB and the results are written to a file. The file is required to be in Shift JIS as it is later used as input for another legacy system. The Wikipedia article indicates that:

The single-byte characters 0x00 to 0x7F match the ASCII encoding, except for a yen sign (U+00A5) at 0x5C and an overline (U+203E) at 0x7E in place of the ASCII character set's backslash and tilde respectively.

During some testing, I have verified that while the yen sign (U+00A5) properly becomes 0x5C, the overline (U+203E) becomes 0x3F (question mark) rather than the expected 0x7E.

While I am doing normal output with a StreamWriter to a file, below is minimal code to reproduce:

    static void Test()
    {
        // Get Shift-JIS encoder.
        var encoding = Encoding.GetEncoding("shift_jis");

        // Declare overline (U+203E).
        char c = (char) 0x203E;

        // Get bytes when encoded as Shift-JIS.
        var bytes = encoding.GetBytes(c.ToString());

        // Expected 0x7E, but the value returned is 0x3F.
    }

Is this behavior correct? I suppose I could subclass EncoderFallback, but this seems like far more work for something that I would have expected to work from the start.

score 1 · Accepted Answer · answered Jan 09 '13 at 08:17

Upon further investigation, I must conclude that Shift JIS is a misnomer. Rather, this is codepage 932. Unicode and Microsoft provide a mapping table between this and Unicode. This is apparently what is being used to map the characters. Notice that it does not contain a mapping between (0x5C, U+00A5) and (0x7E, U+203E).

Note though that I wrote in the original question that "I have verified that while the yen sign (U+00A5) properly becomes 0x5C". Apparently, the Encoding.GetEncoding(String) method returns an encoding which has a DecoderFallback defined as System.Text.InternalDecoderBestFitFallback, which I assume is providing additional mappings for some characters which would normally fail. It must contain an additional mapping for yen (U+00A5), but unfortunately nothing for overline (U+203E). When I replace this with EncoderExceptionFallback if fails for bother characters.

Hence, I conclude that for Shift JIS, this is an error. But for codepage 932, it is the expected result.

Unexpected conversion of Unicode overline (U+203E) to Shift-JIS

1 Answers1