Non Printable characters of UTF-8 - SUSE Linux Java doesn't support

Question

We are implementing a feature to support non printable characters of UTF-8in our Database. Our system stores them in the database and retrieves them. We collect input in the form of base 64, convert them into byte array and store it in database. During retrieval, database gives us the byte array and we convert them to base 64 again.

During the retrieval process (after db gives us the byte array), all the attributes are converted to string arrays and later they are converted back to byte array again and this is converted to base 64 again to give it back to the user.

The below piece of code compiles and works properly in our Windows JDK (Java 8 version). But when this is placed in the SuSe Linux environment, we see strange characters.

public class Tewst {
    public static void main(String[] args) {
        byte[] attributeValues;
        String utfString ;

        attributeValues = new byte[]{-86, -70, -54, -38, -6};
        if (attributeValues != null) {
            utfString = new String(attributeValues);
            System.out.println("The string is "+utfString);
        }
    }
}

The output given is

"The string is ªºÊÚú"

Now when the same file is run on SuSe Linux distribution, it gives me:

"The string is ��"

We are using Java 8 in both Windows and Linux. What is the problem that it doesnt execute properly in Linux?

We have also tried utfString = new String(attributeValues,"UTF-8");. It didnt help in any way. What are we missing?

I'd guess, this has nothing to do with the Java program but with the font you are using to print the string. — Henry, Jun 09 '17 at 05:16
If that is so, when the string is converted back to byte array , it ought to give the original byte array, but it is giving something else and not the original byte array. We are getting 15 values instead of 5 values in the byte array. — javaShilp, Jun 09 '17 at 05:18
"The below piece of code compiles and works properly in our Windows JDK (Java 8 version)" seems unlikely - if that's meant to be UTF-8-encoded text, and you're using the default encoding (*never* do that implicitly) then you won't be getting the right results. — Jon Skeet, Jun 09 '17 at 05:48

score 1 · Answer 1 · answered Jun 09 '17 at 05:36

The characters ªºÊÚú are Unicode 00AA 00BA 00CA 00DA 00FA.

In character set ISO-8859-1, that is bytes AA BA CA DA FA.
In decimal, that would be {-86, -70, -54, -38, -6}, as you have in your code.

So, your string is encoded in ISO-8859-1, not UTF-8, which is also why it doesn't work on Linux, because Linux uses UTF-8, while Windows uses ISO-8859-1.

Never use new String(byte[]), unless you're absolutely sure you want the default character set of the JVM, whatever that might be.

Change code to new String(attributeValues, StandardCharsets.ISO_8859_1).
And of course, in the reverse operation, use str.getBytes(StandardCharsets.ISO_8859_1).
Then is should work consistently on various platforms, since code it no longer using platform defaults.

It worked!!! :) Thank you very much. I think we are under the impression that we are using UTF-8 encoding and just realized that our encoding is not UTF-8. Thanks Andreas. — javaShilp, Jun 09 '17 at 05:52

Non Printable characters of UTF-8 - SUSE Linux Java doesn't support

1 Answers1