I found others having the same issue and their problems were solved by specifying UTF-8 in the InputStreamReader constructor:
https://www.mkyong.com/java/how-to-read-utf-8-encoded-data-from-a-file-java/
This is not working for me and I don't know why. No matter what I try, I keep getting the escaped unicode values (slash-U + hexadecimal) instead of the actual language characters. What am I doing wrong here? Thanks in advance!
// InputStream is is a FileInputStream:
public void load(InputStream is) throws Exception {
BufferedReader br = null;
try {
// Passing "UTF8" or "UTF-8" to this constructor makes no difference for me:
br = new BufferedReader(new InputStreamReader(is, StandardCharsets.UTF_8));
String line = null;
while ((line = br.readLine()) != null) {
// The following prints "got line: chinese = \u4f60\u597d" instead of "got line: chinese = 你好"
System.out.println("got line: " + line);
}
} finally {
if (br != null) {
br.close();
}
}
}
Please note: This is not a font issue. I know this because if I use a ResourceBundle against the same file, I correctly get the Chinese characters printed in the IDE console. But whenever I try to read the file manually using the FileInputStream, something keeps converting the characters into the slash/u convention. Even though I'm telling it to use UTF-8 encoding. I also tried tinkering with the project's encoding JVM parameters, but still no joy. Thanks again for any advice.
Also, using the ResourceBundle as a final solution is not an option for me. There are legitimate reasons for this particular project why it's not quite the right tool for the job, and why I'm trying to do this explicitly myself.
EDIT: I tried pulling the bytes from the InputStream manually, completely bypassing the InputStreamReader and its constructor which seems to be ignoring my encoding parameters. This just results in the same behavior. Slash+U convention instead of correct characters. It's hard to understand why I can't get this to work the same way it works for seemingly everyone else. Do I maybe have a system/OS setting somewhere that's overriding Java's ability to properly handle unicode? I'm using Java version 1.8.0_65 (64-bit) on Windows 7 version 6.1 (also 64-bit).
public void load(InputStream is) throws Exception {
String line = null;
try {
while ((line = readLine(is)) != null) {
// The following prints "got line: chinese = \u4f60\u597d" instead of "got line: chinese = 你好"
System.out.println("got line: " + line);
}
} finally {
is.close();
}
}
private String readLine(InputStream is) throws Exception {
List<Byte> bytesList = new ArrayList<>();
while (true) {
byte b = -1;
try {
b = (byte)is.read();
} catch (EOFException e) {
return bytesToString(bytesList);
}
if (b == -1) {
return bytesToString(bytesList);
}
char ch = (char)b;
if (ch == '\n') {
return bytesToString(bytesList);
}
bytesList.add(b);
}
}
private String bytesToString(List<Byte> bytesList) {
if (bytesList.isEmpty()) {
return null;
}
byte[] bytes = new byte[bytesList.size()];
for (int i = 0; i < bytes.length; i++) {
bytes[i] = bytesList.get(i);
}
return new String(bytes, 0, bytes.length);
}