0

We are trying to implement REST-API where the json response is received and converted into string json format. We are trying to write this string content to Mapr FS by opening stream.

FileSystem mfsHandler;

...
...

fsDataStream = mfsHandler.create(new Path("/demo/test.txt"), true);

String name = "Just to test";
byte[] namebytes = name.getBytes();
// fsDataStream.write(namebytes);
BufferedOutputStream bos = new BufferedOutputStream(fsDataStream);
bos.write(namebytes);

However, on writing the content, it is appending 8 bits making the string shift to right by 8 bits. The output is: ’^@^EJust to test

I tried following the post-http://stackoverflow.com/questions/19687576/unwanted-chars-written-from-java-rest-api-to-hadoopdfs-using-fsdataoutputstream, but couldn't get the solution.

How to avoid this junk char? Any alternative to avoid the 8-bit right shift?

lovely
  • 1

1 Answers1

0

The problem here has to do with the encoding of a Java string. You can select which encoding you want to use when you call getBytes.

For example, here is a tiny program that prints out the bytes for three different encodings:

public void testEncoding() throws UnsupportedEncodingException {
    String s = "Sample text üø 漢字";

    asHex(System.out, s, "UTF-8");
    asHex(System.out, s, "UTF-16");
    asHex(System.out, s, "SHIFT_JIS");
}

public void asHex(PrintStream out, String msg, String encoding) throws UnsupportedEncodingException {
    byte[] buf = msg.getBytes(encoding);
    System.out.printf("\n\n%s - %s\n", msg, encoding);
    for (int i = 0; i < buf.length; i++) {
        byte b = buf[i];
        System.out.printf("%02x ", b & 0xff);
        if (i % 16 == 15) {
            System.out.printf("\n");
        }
    }
    System.out.printf("\n");
}

Here is the output:

Sample text üø 漢字 - UTF-8
53 61 6d 70 6c 65 20 74 65 78 74 20 c3 bc c3 b8 
20 e6 bc a2 e5 ad 97 


Sample text üø 漢字 - UTF-16
fe ff 00 53 00 61 00 6d 00 70 00 6c 00 65 00 20 
00 74 00 65 00 78 00 74 00 20 00 fc 00 f8 00 20 
6f 22 5b 57 


Sample text üø 漢字 - SHIFT_JIS
53 61 6d 70 6c 65 20 74 65 78 74 20 3f 3f 20 8a 
bf 8e 9a 

If you call getBytes() without specifying the character set to use then you will get whatever is your default character set. That can vary all over the place and thus it is almost always better to specify what you want.

Ted Dunning
  • 1,877
  • 15
  • 12