4

Problem: Arabic words in my text files read by java show as series of question marks : ??????

Here is the code:

        File[] fileList = mainFolder.listFiles();
        BufferedReader bufferReader = null;
        Reader reader = null;


        try{

        for(File f : fileList){           
            reader = new InputStreamReader(new FileInputStream(f.getPath()), "UTF8");
            bufferReader = new BufferedReader(reader);
            String line = null;

            while((line = bufferReader.readLine())!= null){
               System.out.println(new String(line.getBytes(), "UTF-8"));
            }              

        }
        }
        catch(Exception exc){
            exc.printStackTrace();
        }

        finally {
            //Close the BufferedReader
            try {
                if (bufferReader != null)
                    bufferReader.close();
            } catch (IOException ex) {
                ex.printStackTrace();
            }

As you can see I have specified the UTF-8 encoding in different places and still I get question marks, do you have any idea how can I fix this??

Thanks

BalusC
  • 1,082,665
  • 372
  • 3,610
  • 3,555
Saher Ahwal
  • 9,015
  • 32
  • 84
  • 152

2 Answers2

3

Instead of trying to print out the line directly, print out the Unicode values of each character. For example:

char[] chars = line.toCharArray();
for (int i = 0; i < chars.length; i++)
{
    System.out.println(i + ": " + chars[i] + " - " + (int) chars[i]);
}

Then look up the relevant characters in the Unicode code charts.

If you find it's printing 63, then those really are question marks... which would suggest that your text file isn't truly UTF-8 to start with.

If, on the other hand for some characters it's printing out "?" but then a value other than 63, then that would suggest it's a console display issue and you're reading the data correctly.

Jon Skeet
  • 1,421,763
  • 867
  • 9,128
  • 9,194
2

Replace

System.out.println(new String(line.getBytes(), "UTF-8"));

by

System.out.println(line);

The String#getBytes() without the charset argument namely uses platform default encoding to get the bytes from the string, which may not be UTF-8 per se. You're already reading the bytes as UTF-8 by InputStreamReader, so you don't need to massage it forth and back afterwards.

Further, ensure that your display console (where you're reading those lines) supports UTF-8. In for example Eclipse, you can do that by Window > Preferences > General > Workspace > Text File Encoding > Other > UTF-8.

See also:

Community
  • 1
  • 1
BalusC
  • 1,082,665
  • 372
  • 3,610
  • 3,555
  • I had this and It still displayed ????. However how do I make sure the console supports UTF-8? I'm pretty sure my console has support for utf-8 since I was able to print some arabic text on it from another project from the mySQL database. – Saher Ahwal Dec 22 '10 at 23:20
  • Depends on the kind of console. I added an example for Eclipse. It will however still fail when you massage the bytes forth and back using the default encoding as you did in your initial code. – BalusC Dec 22 '10 at 23:21
  • do you know how can I do this in netbeans? Thanks – Saher Ahwal Dec 22 '10 at 23:27
  • Sorry, I haven't touched Netbeans for years. Look around in its manual (F1?) using keyword "encoding". To exclude the one or other, does it display properly when you do `System.out.println("somearabic")` directly? – BalusC Dec 22 '10 at 23:27
  • Yes It does display properly when I do System.out.println("arabic stuff"). Can the problem be in the notepad or the txt file encoding itself?? is that possible? – Saher Ahwal Dec 22 '10 at 23:33
  • 1
    Is the file itself saved as UTF-8? In Notepad you can specify that in some dropdown below the file name field on *Save As*. Otherwise please post a small snippet of arabic text and the byte numbers of the byte array which you got when you read it by `InputStream`, then we can determine if it has properly been decoded. – BalusC Dec 22 '10 at 23:36
  • Thanks. That was it , It was notepad, I didn't notice the encoding there! – Saher Ahwal Dec 22 '10 at 23:39