Latin Character Inbetween String

Question

I have a program to read in a file which contain latin words such as "\xed". These latin words can appear anywhere in between any line, as such I have program parsing these character. Is there any library that can do so?

@Andreas, just found out that it should be parse to \u00ed which is a "Latin Small Letter I with acute" — chj, Aug 25 '15 at 02:32
Are you saying it contains the *string* "K\xedng", or is it containing the *bytes* `4B ED 6E 67` ("Kíng")? — Andreas, Aug 25 '15 at 02:54
@Andreas, nope. i guess is the original file parsing. I overcome by replaceing "\x" with "\u00" and do a StringEscapeUtils. — chj, Aug 26 '15 at 02:24

score 0 · Answer 1 · answered Aug 25 '15 at 02:06

The simple way I often do is InputStreamReader with "UTF8" format. For example:

         try {
            File fileDir = new File("c:/temp/sample.txt");

            BufferedReader in = new BufferedReader(
                    new InputStreamReader(
                            new FileInputStream(fileDir), "UTF8"));

            String str;

            while ((str = in.readLine()) != null) {
                System.out.println(str);
            }

            in.close();
        } 
        catch (UnsupportedEncodingException e) 
        {
            System.out.println(e.getMessage());
        } 
        catch (IOException e) 
        {
            System.out.println(e.getMessage());
        }
        catch (Exception e)
        {
            System.out.println(e.getMessage());
        }

Andreas · Answer 2 · 2015-08-25T02:43:06.393

If you mean that the text is in bytes and you have a byte with the hex value ED, then the interpretation of that byte depends on your code page.

Java stores all String's internally in UTF-16. This means that a code page conversion is pretty much always applied when reading and writing file (UTF-16 is not a common file encoding).

By default, Java will use the platform default character set. If this is not the correct one, you have to specify the Charset to use.

As an example of the problem, byte ED is:

ISO-8859-1: í (unicode 00ED) US Windows
Windows-1251: н (unicode 043D) Russian
Code page 437: φ (unicode 03C6) US Windows command-line (Win 7)

To control the code page, read the file like this:

File file = new File("C:\\path\\to\\file.txt");
try (BufferedReader in = new BufferedReader(new InputStreamReader(new FileInputStream(file), "ISO-8859-1"))) {
    String line;
    while ((line = in.readLine()) != null) {
        // process line here
    }
}

Or with the newer Path API:

Path path = Paths.get("C:\\path\\to\\file.txt");
try (BufferedReader in = Files.newBufferedReader(path, Charset.forName("ISO-8859-1"))) {
    String line;
    while ((line = in.readLine()) != null) {
        // process line here
    }
}

I guess if the entire file is in ISO8859 format your method will work prefectly! however, my file is a mixture of iso8859 and utf8 — chj, Aug 25 '15 at 02:48
@chj Surely you're joking. UTF-8 uses bytes 80-FF and so does ISO-8859-1. How are you supposed to know if a byte in that range is one or the other? — Andreas, Aug 25 '15 at 02:49

Latin Character Inbetween String

2 Answers2