Buffer Reader encoding charset for Russian characters

Question

Currently we are facing some issues wherein Russian characters are getting converted to junk data which is seen as rectangles in notepad. Below is the code we are using and the code is executing on Linux server with Java 1.8

BufferReader buff=new BufferReader(new FileReader(new File("text.txt")));
String line;
StringBuffer result;
while((line=buff.readLine())!=null)
{
   result.append(line).append('\n');
}
return result.toString.getBytes();

Earlier same code use to work on AIX environment with java 1.6.

Can anyone please give me a hint what might be going wrong. As this seems to be totally environmental since no code changes has been done.

JavaSheriff · Answer 1 · 2018-08-15T18:03:21.337

0

Try this

    BufferedReader buff= new BufferedReader(
       new InputStreamReader(
                  new FileInputStream(fileDir), "UTF-8"));

Edit

make sure the file is saved as UTF-8

edited Aug 15 '18 at 18:03

answered Aug 15 '18 at 18:02

JavaSheriff

7,074
20
89
159

1

we tried that but still same issue. Looks like the source file is in ASCII encoding. – Mayank Vaid Aug 15 '18 at 18:03
1

When saving in UTF-8 it removes some of the characters due to which macros are removed from the file ... So the file needs to be saved in ASCII only. – Mayank Vaid Aug 15 '18 at 18:08
we tried this "Cp1251". It worked but it doesn't give option to save in ASCII encoding – Mayank Vaid Aug 15 '18 at 18:10
There is no such thing as ASCII when speaking of Russian characters. Try these one-by-one: UTF-8, ISO-8859-5, Windows-1251, CP866 – Lorinczy Zsigmond Aug 15 '18 at 18:43
Noone ever uses ISO-8859-5, forget about it. But do try Windows-1251 a.k.a. cp1251 and KOI8R. – rustyx Aug 15 '18 at 18:55
Also if you call String.getBytes (or String-constructor-from-byte[]) without specifying an encoding, some unpredictable platform-specific default-value will be used as encoding (which very well might be 7-bit ASCII). – Lorinczy Zsigmond Aug 15 '18 at 19:33

score 0 · Accepted Answer · answered Aug 16 '18 at 04:51

Your code seems to be reading the whole file into a byte-array. That can be done this way:

static byte [] GetFileBytes (String filename) 
    throws java.io.FileNotFoundException,
           java.io.IOException {
    java.io.File f= new java.io.File (filename);
    java.io.FileInputStream fi= new java.io.FileInputStream (f);
    long fsize = f.length ();
    byte b [] = new byte [(int)fsize];
    int rsize= fi.read (b, 0, (int)fsize);
    fi.close ();
    if (rsize!=fsize) {
        byte [] btmp= new byte [rsize];
        System.arraycopy (b, 0, btmp, 0, rsize);
        b= btmp;
    }
    return b;
}

Or, within your code, you can pick an encoding, and use it in both conversion:

static byte [] GetFileByteArray (String filename)
    throws Exception {
    String cset= "ISO-8859-1"; /* any one-byte encoding */
    java.io.BufferedReader buff=
        new java.io.BufferedReader
            (new java.io.InputStreamReader
                (new java.io.FileInputStream(filename), cset));
    String line;
    StringBuffer result= new StringBuffer ();
    while((line=buff.readLine())!=null)
    {
        result.append(line).append('\n');
    }
    return result.toString().getBytes(cset);
}

Buffer Reader encoding charset for Russian characters

2 Answers2