I have a program to read in a file which contain latin words such as "\xed". These latin words can appear anywhere in between any line, as such I have program parsing these character. Is there any library that can do so?
Asked
Active
Viewed 1,681 times
-2
-
What do you mean by in between? – LJNielsenDk Aug 25 '15 at 02:11
-
1I didn't know `\xed` was a latin *word*. What does it mean? – Andreas Aug 25 '15 at 02:18
-
@Andreas, just found out that it should be parse to \u00ed which is a "Latin Small Letter I with acute" – chj Aug 25 '15 at 02:32
-
@LJNielsenDk, example i have K\xedng which should be Kíng – chj Aug 25 '15 at 02:33
-
Are you saying it contains the *string* "K\xedng", or is it containing the *bytes* `4B ED 6E 67` ("Kíng")? – Andreas Aug 25 '15 at 02:54
-
@Andreas, nope. i guess is the original file parsing. I overcome by replaceing "\x" with "\u00" and do a StringEscapeUtils. – chj Aug 26 '15 at 02:24
2 Answers
0
The simple way I often do is InputStreamReader with "UTF8" format. For example:
try {
File fileDir = new File("c:/temp/sample.txt");
BufferedReader in = new BufferedReader(
new InputStreamReader(
new FileInputStream(fileDir), "UTF8"));
String str;
while ((str = in.readLine()) != null) {
System.out.println(str);
}
in.close();
}
catch (UnsupportedEncodingException e)
{
System.out.println(e.getMessage());
}
catch (IOException e)
{
System.out.println(e.getMessage());
}
catch (Exception e)
{
System.out.println(e.getMessage());
}

Kenny Tai Huynh
- 1,464
- 2
- 11
- 23
0
If you mean that the text is in bytes and you have a byte with the hex value ED
, then the interpretation of that byte depends on your code page.
Java stores all String
's internally in UTF-16. This means that a code page conversion is pretty much always applied when reading and writing file (UTF-16 is not a common file encoding).
By default, Java will use the platform default character set. If this is not the correct one, you have to specify the Charset
to use.
As an example of the problem, byte ED
is:
- ISO-8859-1:
í
(unicode00ED
) US Windows - Windows-1251:
н
(unicode043D
) Russian - Code page 437:
φ
(unicode03C6
) US Windows command-line (Win 7)
To control the code page, read the file like this:
File file = new File("C:\\path\\to\\file.txt");
try (BufferedReader in = new BufferedReader(new InputStreamReader(new FileInputStream(file), "ISO-8859-1"))) {
String line;
while ((line = in.readLine()) != null) {
// process line here
}
}
Or with the newer Path
API:
Path path = Paths.get("C:\\path\\to\\file.txt");
try (BufferedReader in = Files.newBufferedReader(path, Charset.forName("ISO-8859-1"))) {
String line;
while ((line = in.readLine()) != null) {
// process line here
}
}

Andreas
- 154,647
- 11
- 152
- 247
-
I guess if the entire file is in ISO8859 format your method will work prefectly! however, my file is a mixture of iso8859 and utf8 – chj Aug 25 '15 at 02:48
-
@chj Surely you're joking. UTF-8 uses bytes 80-FF and so does ISO-8859-1. How are you supposed to know if a byte in that range is one or the other? – Andreas Aug 25 '15 at 02:49