-3

I'm using the following code to process a large text file, line by line. The problem is that I'm using a language other than English, Croatian to be precise. Many of the characters appear as � in the output file. How can I resolve this?

The file is in ANSI, but this does not seem to be an encoding type compatiable with InputStreamReader. What encoding type should I save the original file as?

try (BufferedWriter bw = new BufferedWriter(new FileWriter(FILENAME))) {

 String line;
 try {
  try (
   InputStream fis = new FileInputStream("C:\\Users\\marti\\Documents\\Software Projects\\Java Projects\\TwitterAutoBot\\src\\main\\resources\\EH.Txt"); InputStreamReader isr = new InputStreamReader(fis, Charset.forName("UTF-8")); BufferedReader br = new BufferedReader(isr);
  ) {
   while ((line = br.readLine()) != null) {
    // Deal with the line

    String content = line.substring(line.lastIndexOf("  ") + 1);
    System.out.println(content);

    bw.write("\n\n" + content);

   }
  }
 } catch (IOException e) {
  e.printStackTrace();
 }

 // bw.close();

} catch (IOException e) {

 e.printStackTrace();

}
Martin Erlic
  • 5,467
  • 22
  • 81
  • 153
  • 1
    What encoding is your input file using? – Greg Kopff Dec 18 '17 at 00:43
  • @GregKopff It's in ANSI. – Martin Erlic Dec 18 '17 at 00:58
  • 1
    @MartinErlic If it is `ANSI`, *why* did you specify **`UTF-8`** in your code? --- If it is [`ANSI`](https://en.wikipedia.org/wiki/ANSI_character_set), which flavor of [extended ANSI](https://en.wikipedia.org/wiki/Extended_ASCII) is it? – Andreas Dec 18 '17 at 01:02
  • Because I didn't check the character encoding of the file before hand! – Martin Erlic Dec 18 '17 at 01:05
  • However, ANSI is not a recognized encoding type in InputStreamReader. Somebody suggested to use ``US-ASCII`` but this doesn't work either, producing the same weird characters. Neither does saving the file as a UTF-8 because I lose the translations. – Martin Erlic Dec 18 '17 at 01:08
  • @MartinErlic What "translations" you talking about? You shouldn't have any problems with UTF-8 for any europen language. Wikipedia also claims that [Windows-1250](https://en.wikipedia.org/wiki/Windows-1250) is suitable for Croatian. – user882813 Dec 18 '17 at 01:30

2 Answers2

0

I solved this by encoding with Cp1252 instead of UTF-8 because the file was encoded in ANSI.

Martin Erlic
  • 5,467
  • 22
  • 81
  • 153
-1

You need to use the InputStreamReader/OutputStreamWriter constructors that take a Charset. The constructor that you are using are using the default charset for your platform, which evidently is not what you need.

If you're using Java 8 or above, you might use one of the convenience methods in Files:

You need to ensure that you're reading the input file with the correct charset, as well as writing a file in a charset that supports the characters you're trying to write. UTF-8 is a suitable output file format.

Greg Kopff
  • 15,945
  • 12
  • 55
  • 78