FileWriter somehow write in chinese

Question

Please help me with this problem. I'm trying to write a code that read a .txt file and then it would count the frequencies of each letter in the file. This is what I came up with :

public static void charCount(String file) throws IOException {
        FileReader fr = new FileReader(file);
        BufferedReader br = new BufferedReader(fr);

        int[] count = new int[26];
        String line;
        while ((line = br.readLine()) != null) {
            line = line.toUpperCase();
            char[] characters = line.toCharArray();
            for (int i = 0; i < line.length(); i++) {
                if ((characters[i] >='A') && (characters[i] <='Z')) {
                    count[characters[i] - 'A']++;
                }
            }
        }
        File file2 = new File("D:/Project/Aufgabe/Winter_2019/frequency.txt");
        file2.createNewFile();
        FileWriter fw = new FileWriter(file2);
        for (int i = 0; i < 26; i++) {
            fw.write(((char)(i + 'A')) + ": " + count[i]);
        }
        fw.close();
        br.close();
    }

When I tried to print the result in the console with System.out.println(), it gives out these results:

A: 15
B: 4
C: 9
D: 10
E: 2
F: 1
G: 0
H: 3
I: 5
J: 6
K: 3
L: 0
M: 2
N: 7
O: 3
P: 1
Q: 1
R: 0
S: 4
T: 0
U: 2
V: 0
W: 5
X: 0
Y: 1
Z: 0

Which is what I want. But when I tried to write it in a file, it gives it these results in the .txt file:

㩁ㄠ䈵›䌴›䐹›〱㩅㈠㩆ㄠ㩇〠㩈㌠㩉㔠㩊㘠㩋㌠㩌〠㩍㈠㩎㜠㩏㌠㩐ㄠ㩑ㄠ㩒〠㩓㐠㩔〠㩕㈠㩖〠㩗㔠㩘〠㩙ㄠ㩚〠

I'm still new to java, so a help would be much appreciated.

You seem to have a strange default charset. Try using a specific charset for file output, e.g `new FileWriter(file2, StandardCharsets.UTF_8)`. — aventurin, Oct 22 '19 at 18:58
The 'chinese' is often a sign that something intending to be 8-bit characters (say, ASCII) is being interpreted as 16-bit characters (likely UTF-16). — , Oct 22 '19 at 19:12
@aventurin That constructor required Java 11 or higher, previous versions lack that one. Might be OK here but might be worth mentioning nontheless — Lothar, Oct 22 '19 at 19:59

Lothar · Accepted Answer · 2019-10-22T20:24:36.593

While there are a couple of things about your program that can be improved, none of them are the reason why you see chinese characters. In fact your program seems to work just fine and the resulting file actually contains the text you've seen when trying it with System.out.println.

I've copied your output example, pasted it into a new file using Notepad and after saving, had a look at the file using a HEX-editor (here HxD). The hex data started like this: FF FE 41 3A 20 31 35 42... which "translates" to ÿþA: 15B.... That's exactly your expected result plus a BOM (Byte Order Marker) that was created by Notepad while saving the file and is therefor not part of the original data.

So why do you see the strange result? Reason is not your program but the text viewer you're using. Many of these try to do a an educated guess if the file misses a BOM to decide if (in case of Windows Notepad) a file should be read with cp1252 (Windows Latin-1), UTF-8 or Unicode/UTF-16. There are different algorithms so it's hard to say why your viewer decided that this might be UTF-16 but that's the way it is ;-)

I have a guess and a fix for your problem might be to change

fw.write(((char)(i + 'A')) + ": " + count[i]);

to

fw.write(((char)(i + 'A')) + ": " + count[i] + "\r\n");

Alternatively write the file using a charset that includes a BOM, e.g. UTF-8 or UTF-16. With Java 11 you can do that with FileWriter directly (there is a new constructor that allows to set that), if you have to use an older version of Java, you need to use OutputStreamWriter:

OutputStreamWriter fw = new OutputStreamWriter(new FileOutputStream(file2), "UTF8");

Also: Check your text viewer if the "Open File"-dialog allows you to specify the Charset explicitly, Notepad on a german Windows system calls the Option "Codierung" and "ANSI" is "cp1252" (the charset your Java Virtual Machine should have used when using FileWriter without specific charset).

[This Windows misfeature has its own wikipedia article.](https://en.wikipedia.org/wiki/Bush_hid_the_facts) Java 'Charset' UTF-8 (alias UTF8) does _not_ write BOM (in -8 it only flags Unicode, not really byte order); UTF-16 and X-UTF-16LE-BOM (unicode or UnicodeBig and UnicodeLittle) do, but UTF-16BE and UTF-16LE (Unicode{Big,Little}Unmarked) do not. — dave_thompson_085, Oct 24 '19 at 14:40

score 0 · Answer 2 · answered Oct 22 '19 at 19:42

0

Change this line fw.write(((char)(i + 'A')) + ": " + count[i]);

to

fw.write(" "+((char)(i + 'A')) + ": " + count[i]);

answered Oct 22 '19 at 19:42

rahulP

244
2
6

1

can you explain why that should make a difference. Both variants will be compiled against `Writer.write(String)` – Lothar Oct 22 '19 at 20:01

FileWriter somehow write in chinese

2 Answers2