How to Properly Read Unicode from InputStream?

Question

I found others having the same issue and their problems were solved by specifying UTF-8 in the InputStreamReader constructor:

Reading InputStream as UTF-8

https://www.mkyong.com/java/how-to-read-utf-8-encoded-data-from-a-file-java/

This is not working for me and I don't know why. No matter what I try, I keep getting the escaped unicode values (slash-U + hexadecimal) instead of the actual language characters. What am I doing wrong here? Thanks in advance!

// InputStream is is a FileInputStream:
public void load(InputStream is) throws Exception {

    BufferedReader br = null;

    try {
        // Passing "UTF8" or "UTF-8" to this constructor makes no difference for me:
        br = new BufferedReader(new InputStreamReader(is, StandardCharsets.UTF_8));
        String line = null;         
        while ((line = br.readLine()) != null) {
            // The following prints "got line: chinese = \u4f60\u597d" instead of "got line: chinese = 你好"
            System.out.println("got line: " + line);
        }
    } finally {
        if (br != null) {
            br.close();
        }
    }       
}

Please note: This is not a font issue. I know this because if I use a ResourceBundle against the same file, I correctly get the Chinese characters printed in the IDE console. But whenever I try to read the file manually using the FileInputStream, something keeps converting the characters into the slash/u convention. Even though I'm telling it to use UTF-8 encoding. I also tried tinkering with the project's encoding JVM parameters, but still no joy. Thanks again for any advice.

Also, using the ResourceBundle as a final solution is not an option for me. There are legitimate reasons for this particular project why it's not quite the right tool for the job, and why I'm trying to do this explicitly myself.

EDIT: I tried pulling the bytes from the InputStream manually, completely bypassing the InputStreamReader and its constructor which seems to be ignoring my encoding parameters. This just results in the same behavior. Slash+U convention instead of correct characters. It's hard to understand why I can't get this to work the same way it works for seemingly everyone else. Do I maybe have a system/OS setting somewhere that's overriding Java's ability to properly handle unicode? I'm using Java version 1.8.0_65 (64-bit) on Windows 7 version 6.1 (also 64-bit).

public void load(InputStream is) throws Exception {     
    String line = null;     
    try {
        while ((line = readLine(is)) != null) {
            // The following prints "got line: chinese = \u4f60\u597d" instead of "got line: chinese = 你好"
            System.out.println("got line: " + line);                
        }           
    } finally {
        is.close();
    }       
}

private String readLine(InputStream is) throws Exception {      
    List<Byte> bytesList = new ArrayList<>();       
    while (true) {
        byte b = -1;

        try {
            b = (byte)is.read();
        } catch (EOFException e) {
            return bytesToString(bytesList);
        }           
        if (b == -1) {
            return bytesToString(bytesList);
        }
        char ch = (char)b;
        if (ch == '\n') {
            return bytesToString(bytesList);
        }
        bytesList.add(b);
    }       
}

private String bytesToString(List<Byte> bytesList) {        
    if (bytesList.isEmpty()) {
        return null;
    }       
    byte[] bytes = new byte[bytesList.size()];
    for (int i = 0; i < bytes.length; i++) {
        bytes[i] = bytesList.get(i);
    }       
    return new String(bytes, 0, bytes.length);
}

score 0 · Answer 1 · answered Sep 13 '17 at 14:56

In case anyone else out there encounters the same troubles, I was able to find a solution. Since the ResourceBundle was always doing the right thing for me, I dug into why that is and found that java.util.Properties is doing all the magic with a loadConvert() function. After the BufferedReader gives me a line of text from the file, I need to explicitly decode the Unicode escaped characters in that String, kind-of like this:

public void load(InputStream is) throws Exception {

    BufferedReader br = null;

    try {
        // Passing "UTF8" or "UTF-8" to this constructor makes no difference for me:
        br = new BufferedReader(new InputStreamReader(is, StandardCharsets.UTF_8));
        String line = null;         
        while ((line = br.readLine()) != null) {
            // The following prints "got line: chinese = \u4f60\u597d" instead of "got line: chinese = 你好"
            System.out.println("got line: " + line);
            line = decodeUni(line);
            // The following prints "decoded line: chinese = 你好" exactly as it should!
            System.out.println("decoded line: " + line);
        }
    } finally {
        if (br != null) {
            br.close();
        }
    }       
}

// Converts encoded "\\uxxxx" to unicode chars
private String decodeUni(String string) {

    char[] charsIn = string.toCharArray();
    int len = charsIn.length;
    char[] charsOut = new char[len];
    char ch;
    int outLen = 0;
    int off = 0;
    int end = off + len;

    while (off < end) {
        ch = charsIn[off++];
        // Does aChar start with "\\u" ?
        if (ch == '\\') {
            ch = charsIn[off++];
            if(ch == 'u') {
                // Yep! Convert the hex part to the correct character.
                int value = 0;
                for (int i = 0; i < 4; i++) {
                    ch = charsIn[off++];  
                    switch (ch) {
                        case '0': case '1': case '2': case '3': case '4':
                        case '5': case '6': case '7': case '8': case '9': {
                            value = (value << 4) + ch - '0';
                            break;
                        }
                        case 'a': case 'b': case 'c': case 'd': case 'e': case 'f': {
                            value = (value << 4) + 10 + ch - 'a';
                            break;
                        }
                        case 'A': case 'B': case 'C': case 'D': case 'E': case 'F': {
                            value = (value << 4) + 10 + ch - 'A';
                            break;
                        }
                        default: throw new IllegalArgumentException("Malformed \\uxxxx encoding: " + string);
                    }
                }
                charsOut[outLen++] = (char)value;
            } else {
                // Starts with a slash but not "\\u", handle the other possible escaped characters.
                switch (ch) {
                    case 't':
                        ch = '\t';
                        break;
                    case 'r':
                        ch = '\r';
                        break;
                    case 'n':
                        ch = '\n'; 
                        break;
                    case 'f':
                        ch = '\f';
                        break;
                    default:
                        break;
                }
                charsOut[outLen++] = ch;
            }
        } else {
            // Doesn't start with a slash, leave as-is.
            charsOut[outLen++] = ch;
        }
    }
    return new String(charsOut, 0, outLen).trim();
}

How to Properly Read Unicode from InputStream?

1 Answers1