Why is Java BufferedReader() not reading Arabic and Chinese characters correctly?

Question

I'm trying to read a file which contain English & Arabic characters on each line and another file which contains English & Chinese characters on each line. However the characters of the Arabic and Chinese fail to show correctly - they just appear as question marks. Any idea how I can solve this problem?

Here is the code I use for reading:

try {
        String sCurrentLine;
        BufferedReader br = new BufferedReader(new FileReader(directionOfTargetFile));
        int counter = 0;

        while ((sCurrentLine = br.readLine()) != null) {
            String lineFixedHolder = converter.fixParsedParagraph(sCurrentLine);
            System.out.println("The line number "+ counter
                               + " contain : " + sCurrentLine);
            counter++;
        }
    }

Edition 01

After reading the line and getting the Arabic and Chinese word I use a function to translate them by simply searching for Given Arabic Text in an ArrayList (which contain all expected words) (using indexOf(); method). Then when the word's index is found it's used to call the English word which has the same index in another Arraylist. However this search always returns false because it fails when searching the question marks instead of the Arabic and Chinese characters. So my System.out.println print shows me nulls, one for each failure to translate.

*I'm using Netbeans 6.8 Mac version IDE

Edition 02

Here is the code which search for translation:

        int testColor = dbColorArb.indexOf(wordToTranslate);
        int testBrand = -1;
        if ( testColor != -1 ) {
            String result = (String)dbColorEng.get(testColor);
            return result;
        } else {
            testBrand = dbBrandArb.indexOf(wordToTranslate);
        }
        //System.out.println ("The testBrand is : " + testBrand);
        if ( testBrand != -1 ) {
            String result = (String)dbBrandEng.get(testBrand);
            return result;
        } else {
            //System.out.println ("The first null");
            return null;
        }

I'm actually searching 2 Arraylists which might contain the the desired word to translate. If it fails to find them in both ArrayLists, then null is returned.

Edition 03

When I debug I found that lines being read are stored in my String variable as the following:

 "3;0000000000;0000001001;1996-06-22;;2010-01-27;����;;01989;������;"

Edition 03

The file I'm reading has been given to me after it has been modified by another program (which I know nothing about beside it's made in VB) the program made the Arabic letters that are not appearing correctly to appear. When I checked the encoding of the file on Notepad++ it showed that it's ANSI. however when I convert it to UTF8 (which replaced the Arabic letter with other English one) and then convert it back to ANSI the Arabic become question marks!

You need to say what you are trying to output the characters to and what output character set / encoding it configured for. — Stephen C, Feb 14 '10 at 07:31
how about giving us that code, which searches the `ArrayList` instead of explaining it. — Bozho, Feb 14 '10 at 08:07
put a breakpoint, launch in debug mode and trace the execution of the program to see where exactly it differs from your expectatiosn — Bozho, Feb 14 '10 at 08:15
I did out put each variable in each step. The problem I found is that the received word which I want to translate is gotten as >>> not as Arabic or Chinese characters. — M. A. Kishawy, Feb 14 '10 at 08:18
yes, but don't output it to the console - see the value in the debugger. The console includes additional IO operation which may temper the encoding. — Bozho, Feb 14 '10 at 08:32
I didn't know that! However the result shows > marks, Check the resulted line in my question. Any idea what is going on? — M. A. Kishawy, Feb 14 '10 at 08:39
Then the question is what is the encoding of the file you are reading. Is it UTF-8? — Bozho, Feb 14 '10 at 09:03
For example, download Notepad++ and see what it says. Btw, did you set the `-Dfile.encoding=UTF-8` VM arg, as I noted in my answer? — Bozho, Feb 14 '10 at 09:20
ANSI? What the...! Notepad++ shows that it's ANSI! well when I put it ANSI as the encoding I get "java.io.UnsupportedEncodingException: ANSI" — M. A. Kishawy, Feb 14 '10 at 09:46
it's called `ISO-8859-1`. But it cannot contain arabic symbols. I repeat my question about the VM arg. — Bozho, Feb 14 '10 at 09:48
Yes, I changed the VM arg. It didn't help. I got some updates regarding the file. please check the question Update 04 — M. A. Kishawy, Feb 14 '10 at 09:56
“ANSI” is used to mean the system Windows code page of the Windows installtion it's running on. On Western installations that's code page 1252, which is similar to ISO-8859-1. But that code page definitely cannot include Chinese or Arabic. Actually I can't think of a code page used as default system code page in any region that allows both Chinese *and* Arabic. — bobince, Feb 14 '10 at 10:07
please download the file from: http://www.4shared.com/file/221853641/3fa1af8c/data.html — M. A. Kishawy, Feb 14 '10 at 10:16
I don't see any arabic chars in there, so I assume it is corrupt, (and it is logical, since it's ANSI). In which case - ask for a UTF-8 file. — Bozho, Feb 14 '10 at 10:20
I download it on windows it's showing fine but on Mac it fails to show! It's not corrupted but there is something weird in it. It drives me crazy. I can't ask for any other type of files, because this is generated by a third party software and I have to modify it. — M. A. Kishawy, Feb 14 '10 at 10:22
Please check this link to see how it looks on Notepad++ Windows http://www.4shared.com/file/221862075/e8705951/text-Windows.html — M. A. Kishawy, Feb 14 '10 at 10:39
Please check this link to see how it looks on TextEdit Mac http://www.4shared.com/file/221863564/381bfd08/text-Mac.html — M. A. Kishawy, Feb 14 '10 at 10:43
Dropthe hungarian prefixes on your variables. Given that java requries all variables to include the type next to them its really not needed. After all currentLine has to be a string, what else could it be ? — mP., Feb 14 '10 at 12:28

Bozho · Accepted Answer · 2010-02-14T12:32:30.560

24

FileReader javadoc:

Convenience class for reading character files. The constructors of this class assume that the default character encoding and the default byte-buffer size are appropriate. To specify these values yourself, construct an InputStreamReader on a FileInputStream.

So:

Reader reader = new InputStreamReader(new FileInputStream(fileName), "utf-8");
BufferedReader br = new BufferedReader(reader);

If this still doesn't work, then perhaps your console is not set to properly display UTF-8 characters. Configuration depends on the IDE used and is rather simple.

Update : In the above code replace utf-8 with cp1256. This works fine for me (WinXP, JDK6)

But I'd recommend that you insist on the file being generated using UTF-8. Because cp1256 won't work for Chinese and you'll have similar problems again.

edited Feb 14 '10 at 12:32

answered Feb 14 '10 at 06:39

Bozho

588,226
146
1,060
1,140

I'm getting a error "incompatible types - require: java.io.FileReader found: "java.io.InputStreamReader" – M. A. Kishawy Feb 14 '10 at 06:47
1

where are you getting that? just copy the two lines from my updated answer – Bozho Feb 14 '10 at 06:56
ok it's executing...however I still have the same problem of characters showing as > – M. A. Kishawy Feb 14 '10 at 07:18
1

then check the other part of my answer. and tell me what IDE you are using (if you are using one) – Bozho Feb 14 '10 at 07:28
I'm using Netbeans 6.8 Mac version IDE – M. A. Kishawy Feb 14 '10 at 07:45
I'm sorry for the confusion...I'm not using the String comparing, I'm actually searching it in Arraylist. – M. A. Kishawy Feb 14 '10 at 07:58
I double checked the file, it's not corrupted...I mean it's showing fine on most of the PCs I tried it on. Please check the links for the photos that shows the data. Mac: http://www.4shared.com/file/221863564/381bfd08/text-Mac.html PC: http://www.4shared.com/file/221862075/e8705951/text-Windows.html – M. A. Kishawy Feb 14 '10 at 10:47
Thanks Bozho for you incredible effort to assist me in the question. Problem finally is solved :) – M. A. Kishawy Feb 14 '10 at 12:56
@Bozho Thanks a million. `If this still doesn't work, then perhaps your console is not set to properly display UTF-8 characters. Configuration depends on the IDE used and is rather simple.` That was the problem. – Muhammad Babar Aug 17 '14 at 13:48

score 2 · Answer 2 · answered Feb 14 '10 at 06:33

IT is most likely Reading the information in correctly, however your output stream is probably not UTF-8, and so any character that cannot be shown in your output character set is being replaced with the '?'.

You can confirm this by getting each character out and printing the character ordinal.

score 0 · Answer 3 · answered Oct 12 '10 at 10:01

0

public void writeTiFile(String fileName,String str){
    try {
        FileOutputStream out = new FileOutputStream(fileName);
        out.write(str.getBytes("windows-1256"));
    } catch (Exception ex) {
        ex.printStackTrace();
    }
}

answered Oct 12 '10 at 10:01

Ahmad Alhaj Hussein

1

Why is Java BufferedReader() not reading Arabic and Chinese characters correctly?

3 Answers3

Linked

Related