1

I thought this was only an issue with Python 2 but have run into a similar issue now with java (Windows 10, JDK8).

My searches have lead to little resolution so far.

I read from 'stdin' input stream this value: Viļāni. When I print it to console I get this: Vi????ni.

Relevant code snippets are as follows:

   BufferedReader in = new BufferedReader(new InputStreamReader(System.in, StandardCharsets.UTF_8));

    ArrayList<String> corpus = new ArrayList<String>();
    String inputString = null;
    while ((inputString = in.readLine()) != null) {
        corpus.add(inputString);
    }
    String[] allCorpus = new String[corpus.size()];
    allCorpus = corpus.toArray(allCorpus);
    for (String line : allCorpus) {
        System.out.println(line);
    }

Further expansion on my problem as follows:

I read a file containing the following 2 lines: を Sōten_Kōro When I read this from disk and output to a second file I get the following output:

を S�ten_K�ro When I read the file from stdin using cat testinput.txt | java UTF8Tester I get the following output:

??? S??ten_K??ro

Both are obviously wrong. I need to be able to print the correct characters to console and file. My sample code is as follows:

public class UTF8Tester {

    public static void main(String args[]) throws Exception {
        BufferedReader stdinReader = new BufferedReader(new InputStreamReader(System.in, StandardCharsets.UTF_8));
        String[] stdinData = readLines(stdinReader);
        printToFile(stdinData, "stdin_out.txt");

        BufferedReader fileReader = new BufferedReader(new FileReader("testinput.txt"));
        String[] fileData = readLines(fileReader);
        printToFile(fileData, "file_out.txt");

    }

    private static void printToFile(String[] data, String fileName)
            throws FileNotFoundException, UnsupportedEncodingException {
        PrintWriter writer = new PrintWriter(fileName, "UTF-8");
        for (String line : data) {
            writer.println(line);
        }
        writer.close();
    }

    private static String[] readLines(BufferedReader reader) throws IOException {
        ArrayList<String> corpus = new ArrayList<String>();
        String inputString = null;

        while ((inputString = reader.readLine()) != null) {
            corpus.add(inputString);
        }
        String[] allCorpus = new String[corpus.size()];
        return corpus.toArray(allCorpus);
    }

}

Really stuck here and help would really be appreciated! Thanks in advance. Paul

Paul
  • 813
  • 11
  • 27
  • 1
    Cannot reproduce when running in Eclipse, Windows 7. Is the console application you're using capable of displaying UTF-8 characters? – leftbit Jan 16 '19 at 09:02
  • I am running this on Windows in VS Code terminal with the following command: `cat input.txt | java app` – Paul Jan 16 '19 at 09:11

1 Answers1

2
  • System.in/out will use the default Windows character set.
  • Java String will use Unicode internally.
  • FileReader/FileWriter are old utility classes that use the default character set, hence they are for non-portable local files only.

The error you saw, was a special character as two bytes UTF-8 sequence, but every (special UTF-8) byte interpreted as the default single byte encoding, but with a value not present, hence twice a ? substitution.

  • Required is that the character can be entered on System.in in the default charset.
  • Then the String was converted from the default charset.
  • Writing it to file in UTF-8 needs to specify UTF-8.

Hence:

    BufferedReader stdinReader = new BufferedReader(new InputStreamReader(System.in));
    String[] stdinData = readLines(stdinReader);
    printToFile(stdinData, "stdin_out.txt");

    Path path = Paths.get("testinput-utf8.txt");
    List<String> lines = Files.readAllLines(path); // Here the default is UTF-8!

    Path path = Paths.get("testinput-winlatin1.txt");
    List<String> lines = Files.readAllLines(path, "Windows-1252");

    Files.write(lines, Paths.get("file_out.txt"), StandardCharsets.UTF_8);

To check whether your current computer system handles Japanese:

System.out.println("Hiragana letter Wo '\u3092'."); // Either を or ?.

Seeing ? the conversion to the default system encoding could not deliver. を is U+3092, u-encoded as ASCII with \u3092.

To create an UTF-8 text under Windows:

Files.write(Paths.get("out-utf8.txt"),
    "\uFEFFHiragana letter Wo '\u3092'.".getBytes(StandardCharsets.UTF_8));

Here I use an ugly (generally unneeded) BOM marker char \uFEFF (a zero-width space) that will let Windows Notepad recognize the text being in UTF-8.

Joop Eggen
  • 107,315
  • 7
  • 83
  • 138
  • Thanks Joop. So this works if I want to read the data from the file using Files.readAllLines and then outputting this to another file. My biggest issue however is to be able to read it from stdin and then outputting to the console with which I continue to struggle. – Paul Jan 16 '19 at 13:32
  • With System.in and out one has to comply to the system's encoding (in general). If under Windows the encoding is Western Latin-1 (`Windows-1252`) one is out of luck for Greek, Cyrillic and Asian scripts. There exist 3rd party console substitutions (JConsole?) one should be able to use. I'll add some code too. – Joop Eggen Jan 16 '19 at 14:01
  • So I still have not been able to get this working on Windows so have resorted to running this on Linux where these complexities seems to be less complex. Thanks for all comments. – Paul Jan 17 '19 at 06:14
  • Yes, definitely Linux with its ubiquitous UTF-8 is more versatile; have it at home also for that purpose. I am "waiting" for MS Windows to take the leap to Unicode only, who knows UTF-16 or UTF-32. – Joop Eggen Jan 17 '19 at 08:16