Character encoding of parsed strings is wrong only after building jar

Question

I am writing a program that generates PDF files of printable exams. I have all the exam questions stored in a JSON file. The catch is that the exam is in Czech, so there are many special characters (specifically ěščřžýáíé). When I run the program in Idea, it works perfectly - the output is exactly as it is supposed to be.

But when I build the jar executable, the generated files have chunks of wrong encoded text. Specifically anything that went through the JSON parser. Everything hard coded like headers etc. is encoded properly, so the mistake must be in the parser.

The JSON input file is encoded in UTF-8.

I use these two methods to parse the JSON file.

    private static Category[] parseJSON(){
        JSONParser jsonParser = new JSONParser();
        Category[] categories = new Category[0];

        try (FileReader reader = new FileReader("otazky.json")){
            // Read JSON file
            Object obj = jsonParser.parse(reader);

            JSONArray categoryJSONList = (JSONArray) obj;
            java.util.List<JSONObject> categoryList = new ArrayList<>(categoryJSONList);
            categories = new Category[categoryJSONList.size()];

            int i = 0;
            for (JSONObject category : categoryList) {
                categories[i] = parseCategoryObject(category);
                i++;
            }
        } catch (ParseException | IOException e) {
            e.printStackTrace();
        }
        return categories;
    }

    private static Category parseCategoryObject(JSONObject category) {
        String categoryName = (String) category.get("name");

        int generateCount = (int) (long) category.get("generateCount");

        JSONArray questionsJSONArray = (JSONArray) category.get("questions");

        java.util.List<JSONObject> questionJSONList = new ArrayList<>(questionsJSONArray);
        Question[] questions = new Question[questionJSONList.size()];
        int j = 0;

        for (JSONObject question : questionJSONList) {
            JSONArray answers = (JSONArray) question.get("answers");
            String s = (String) question.get("question");
            String[] a = new String[answers.size()];

            for (int i = 0; i < answers.size(); i++) {
                a[i] = answers.get(i).toString();
            }

            int c = (int) (long) question.get("correct");
            Question q = new Question(s, a, c);
            questions[j] = q;
            j++;
        }

        return new Category(categoryName, questions, generateCount);
    }

The output looks like this:

...
PrĂˇvnĂ norma:
a) je obecnÄ› zĂˇvaznĂ© pravidlo chovĂˇnĂ, kterĂ© nemusĂ mĂt urÄŤitou formu,
b) nemĹŻĹľe bĂ˝t souÄŤĂˇstĂ prĂˇvnĂho pĹ™edpisu,
...

While I would need it to look like this:

...
Právní norma:
a) je obecně závazné pravidlo chování, které nemusí mít určitou formu,
b) nemůže být součástí právního předpisu,
...

Did you try explicitly opening the file using the UTF-8 encoding? Looking at the [docs](https://docs.oracle.com/javase/8/docs/api/?java/io/FileReader.html), FileReader will open the file with the default character encoding - which is not UTF-8 on windows. I suggest using `InputStringReader` and `FileInputStream` instead of `FileReader` — Benjamin Urquhart, Apr 19 '19 at 15:49
@BenjaminUrquhart Thank you very much! I find using `InputStringReader` + `FileInputStream` combo confusing, but I ended up trying `Files.readAllLines()`, where you can specify the encoding and it worked! How do I close this question now? :P — vitr, Apr 19 '19 at 16:23

score 1 · Answer 1 · answered Apr 19 '19 at 17:29

1

Benjamin Urquhart suggested that I try using InputStringReader and FileInputStream instead of FileReader to read the file, because with FileReader you cannot specify the encoding (system default is used). I find those two methods hard to use, but I found an alternative - Files.readAllLines, which is fairly easy to use, and it worked.

answered Apr 19 '19 at 17:29

vitr

11
1

Yes, the user's system's current default is never what you want to use-except when it is exactly what you need to use. IETF standards say JSON should be in UTF-8. There is hardly ever a reason for it not to be. – Tom Blodget Apr 19 '19 at 20:34

Character encoding of parsed strings is wrong only after building jar

1 Answers1