1

So I'm trying to write a little decryption program but I'm running into a little trouble. I'm applying XOR to the characters with 'FF' (reversing all the bits) and I'm doing that by converting the string to a byte array then applying the XOR to it. But the characters are in Shift-JIS encoding and something's not working. When I try the method with normal letters, it seems to work but when it gets to the Japanese characters something goes wrong.

public void sampleMethod(String a)
    {
       try {
        String b = "FF";
        byte[] c = a.getBytes("Shift_JIS");
        byte[] d = b.getBytes("Shift_JIS");
        byte[] e = new byte[50];
        for (int i=0; i<c.length; i++)
        {
            e[i] =(byte)(c[i]^d[i%2]);
        }
        String t = new String(e, "Shift_JIS");
        System.out.println(t);
    }
       catch (UnsupportedEncodingException e)
       {
        }

    }

But when I stick in Japanese characters, it converts every single one of them into just 'yyyyyy'. I tried printing out the byte array to see the problem, and it showed that each character was being stored as '63'. How would I get the characters to be stored correctly? Actually, how would I use XOR on the Shift-JIS characters?

I'm using XOR because I basically just want to reverse the bits from say 0010 to 1101 then change it back to characters. Is that possible?

Thanks

For example, this was my input: '始めまして" and what I get out is: "yyyyy" And when I do something like "hello there" I get ".#**)f2.#4#"

Micki
  • 11
  • 3

1 Answers1

3

You simply can't do this kind of byte wise manipulation on multi-byte characters.

Japanese characters (and other extended characters) are typically represented by a series of bytes. Changing these around is likely going to produce invalid sequences which can't be decoded properly (and I guess this is the results that you are seeing).

From the Wikipedia article, Shift JIS

only guarantees that the first byte will be high bit set (0x80–0xFF); the value of the second byte can be either high or low

I would imagine by XOR'ing you are breaking this guarantee.

If you want to reverse the bits and do it back again work with a byte[] data type internally and only turn it back to a string when you're sure it's a Shift JIS structured byte array.

Jeff Foster
  • 43,770
  • 11
  • 86
  • 103
  • Yep that's basically true for any multi-byte encoding. Which also includes any kind of Unicode, so don't even try to get around this by converting it to a different encoding. Although I'm not actually sure WHAT this should accomplish anyhow - seems completely arbitrary. – Voo May 22 '11 at 22:57
  • I basically just want to reverse the bits, from say 001110010 to 110001101, then convert it back to characters. Would that be possible? – Micki May 22 '11 at 22:57
  • Yup, it should be possible to reverse the bits and go back, but make sure you work with byte arrays internally and only turn it back into a string when it's definitely a shift-jis string. – Jeff Foster May 22 '11 at 23:00
  • No. A simple example: Atm (well I hope so at least) Unicode codeplains are only defined up to 0x10FFFF - so if you take any sign that reversed is above this marker it'll be invalid. And then we're not talking about gaps between different codepoints, codepoints that only work when combined with some other cps (I think such things do exist, correct me if I'm wrong). So basically the whole enterprise is doomed from the start. – Voo May 22 '11 at 23:00
  • @Jeff Foster I interpret it that he does want to print the reversed bytes as characters. Obviously if you XOR the same pattern twice you get the input back and that'll work fine - but really what's the use in that? – Voo May 22 '11 at 23:01
  • @Voo Yes, I was interpreting it as a kind of encryption (byte[] array internally, convert to string once decrypted). You are correct that once you start flipping bits around display strings is doomed. – Jeff Foster May 22 '11 at 23:05
  • @Jeff Ok, that's possible. Although in that case we should really warn him about using a simple XOR pattern (and then reversing only) as encryption ;) Base64 would hide the text as well for the untrained eye and would be trivial to display as ASCII. – Voo May 22 '11 at 23:16