2

I'm having troubles with reading a UTF-8 encoded text file in Hebrew. I read all Hebrew characters successfully, except to two letters = 'מ' and 'א'.

Here is how I read it:

    FileInputStream fstream = new FileInputStream(SCHOOLS_LIST_PATH);
BufferedReader br = new BufferedReader(new InputStreamReader(in));
String strLine;

// Read File Line By Line
while ((strLine = br.readLine()) != null) {

                if(strLine.contains("zevel")) {

                    continue;
                }

                schools.add(getSchoolFromLine(strLine));
}

Any idea?

Thanks, Tomer

Peter Lawrey
  • 525,659
  • 79
  • 751
  • 1,130
tomericco
  • 1,544
  • 3
  • 19
  • 30
  • 1
    What are you reading instead of 'מ' and 'א'? – jarnbjo May 09 '11 at 11:39
  • A square and a question mark for each one of these two letters. Something like - "?ם" – tomericco May 09 '11 at 15:07
  • Please don't use DataInputStream to read text. Unfortunately examples like this get copied again and again so can you can remove it from your example. http://vanillajava.blogspot.co.uk/2012/08/java-memes-which-refuse-to-die.html – Peter Lawrey Jan 31 '13 at 00:10

1 Answers1

4

You're using InputStreamReader without specifying the encoding, so it's using the default for your platform - which may well not be UTF-8.

Try:

new InputStreamReader(in, "UTF-8")

Note that it's not obvious why you're using DataInputStream here... just create an InputStreamReader around the FileInputStream.

Jon Skeet
  • 1,421,763
  • 867
  • 9,128
  • 9,194
  • 1
    Is it really likely that he is using a default encoding which is compatible with UTF-8 except for the characters 'מ' and 'א'? – jarnbjo May 09 '11 at 11:38
  • @jarnbjo: I don't know, but it's the most obvious starting point, and it's definitely the first step I'd take. – Jon Skeet May 09 '11 at 12:06
  • Why is that obvious? If he is not using UTF-8 as the default encoding, reading an UTF-8 encoded file with Hebrew characters would produce garbage and not just a few misinterpreted characters. – jarnbjo May 09 '11 at 12:22
  • @jarnbjo: Not specifying an encoding when he expects a particular one is an obvious bad thing to do, is what I meant. The code would definitely be improved by specifying the charset, and it *may* fix the problem. – Jon Skeet May 09 '11 at 12:24
  • Hi, if I'm adding the "UTF-8" - all characters are read as weird signs. I'll try to use FileInputStream only. – tomericco May 09 '11 at 15:11
  • @tomericco: If specifying `UTF-8` makes things *worse*, then it's not a UTF-8 file to start with. What's the source of the file, and what made you think it was UTF-8? – Jon Skeet May 09 '11 at 15:13
  • The file is a regular text file that was saved in UTF-8 encoding via Windows notepad. I tried to omit the use of DataInputStream, but things went worse. – tomericco May 10 '11 at 10:56
  • 1
    @tomericco: It shouldn't have changed anything. It sounds like your way of diagnosing what's going on may be problematic... and if it's *definitely* UTF-8, then that's what you should specify. If you *load* the file in another text editor (not Notepad) specifying UTF-8, does that work? – Jon Skeet May 10 '11 at 11:03
  • What do you mean by saying "load the file in another text editor.."? I'm loading the file using JAVA to my app. I'm saving the file to hard disk via notepad. – tomericco May 10 '11 at 15:52
  • @tomericco: Um, exactly what I said... use another text editor (e.g. Notepad++) and load the file to check that it looks correct there. – Jon Skeet May 10 '11 at 15:53
  • Yes, it looks fine with other text editors. I have to mention that if I'm running the app from Netbeans, it runs OK. But if I'm executing it from the jar that Netbeans produces, the problem occurs. – tomericco May 10 '11 at 16:27
  • @tomericco: Oh, that changes things completely. In that case it's unlikely to be the reading part - it's much more likely to be the display side. How are you displaying the data? On a console? – Jon Skeet May 10 '11 at 16:29
  • No - on a JLabel. I'm reading the text into String and put it like that: label.setText(readLine); – tomericco May 10 '11 at 18:57
  • make sure to use an editor (such as notepad++) and convert the text to UTF-8, Windows deafult is to save as ANSII – Dudi Jun 04 '13 at 21:39