-2

I have a program to read in a file which contain latin words such as "\xed". These latin words can appear anywhere in between any line, as such I have program parsing these character. Is there any library that can do so?

chj
  • 91
  • 9

2 Answers2

0

The simple way I often do is InputStreamReader with "UTF8" format. For example:

         try {
            File fileDir = new File("c:/temp/sample.txt");

            BufferedReader in = new BufferedReader(
                    new InputStreamReader(
                            new FileInputStream(fileDir), "UTF8"));

            String str;

            while ((str = in.readLine()) != null) {
                System.out.println(str);
            }

            in.close();
        } 
        catch (UnsupportedEncodingException e) 
        {
            System.out.println(e.getMessage());
        } 
        catch (IOException e) 
        {
            System.out.println(e.getMessage());
        }
        catch (Exception e)
        {
            System.out.println(e.getMessage());
        }
Kenny Tai Huynh
  • 1,464
  • 2
  • 11
  • 23
0

If you mean that the text is in bytes and you have a byte with the hex value ED, then the interpretation of that byte depends on your code page.

Java stores all String's internally in UTF-16. This means that a code page conversion is pretty much always applied when reading and writing file (UTF-16 is not a common file encoding).

By default, Java will use the platform default character set. If this is not the correct one, you have to specify the Charset to use.

As an example of the problem, byte ED is:

  • ISO-8859-1: í (unicode 00ED) US Windows
  • Windows-1251: н (unicode 043D) Russian
  • Code page 437: φ (unicode 03C6) US Windows command-line (Win 7)

To control the code page, read the file like this:

File file = new File("C:\\path\\to\\file.txt");
try (BufferedReader in = new BufferedReader(new InputStreamReader(new FileInputStream(file), "ISO-8859-1"))) {
    String line;
    while ((line = in.readLine()) != null) {
        // process line here
    }
}

Or with the newer Path API:

Path path = Paths.get("C:\\path\\to\\file.txt");
try (BufferedReader in = Files.newBufferedReader(path, Charset.forName("ISO-8859-1"))) {
    String line;
    while ((line = in.readLine()) != null) {
        // process line here
    }
}
Andreas
  • 154,647
  • 11
  • 152
  • 247
  • I guess if the entire file is in ISO8859 format your method will work prefectly! however, my file is a mixture of iso8859 and utf8 – chj Aug 25 '15 at 02:48
  • @chj Surely you're joking. UTF-8 uses bytes 80-FF and so does ISO-8859-1. How are you supposed to know if a byte in that range is one or the other? – Andreas Aug 25 '15 at 02:49