Converted word document (from Windows-1252 to UTF-8) not displaying characters correctly

Question

I have a Windows-1252 word document that I want to convert to UTF-8. I need to do this to correctly convert the doc file to a pdf. This is how I currently do it:

 Path source = Paths.get("source.doc");
 Path temp = Paths.get("temp.doc");    
 try (BufferedReader sourceReader = new BufferedReader(new InputStreamReader(new FileInputStream(source.toFile()), "windows-1252"));
      BufferedWriter tempWriter = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(temp.toFile()), "UTF-8"))) {
        String line;
        while ((line = sourceReader.readLine()) != null) {
           tempWriter.write(line);
        }
  }

However, when I open the converted file (temp.doc) in Word, it doesn't display some characters correctly. Ü will become Ã¼ for example.

How can I solve this? When I create a new BufferedReader (with UTF-8 encoding) and I read temp, the characters are shown correctly in the console of my IDE.

Side comment: using `Files.newBufferedReader` and `Files.newBufferedWriter` would make your code a lot simpler :) — Jon Skeet, May 21 '14 at 10:11

score 2 · Answer 1 · answered May 21 '14 at 09:46

2

I have a Windows-1252 word document

That's not a text file. Word documents are basically binary data - open it up with a plain text editor and you'll see all kinds of gibberish. You may see some text in there as well, but basically it's not a plain text file, which is how you're trying to read it.

It's not even clear to me what a "Windows-1252 word document" means... Word will use whatever encoding it wants internally, and I'm not sure there's any control over that. I would expect any decent "doc to PDF" converter to handle any valid Word document.

When I create a new BufferedReader (with UTF-8 encoding) and I read temp, the characters are shown correctly in the console of my IDE.

If that's the case, that suggests it is a plain text file to start with, not a Word document. You need to be very clear in your own mind exactly what you've got - a Word document, or a plain text file. They're not the same thing, and shouldn't be treated the same way.

answered May 21 '14 at 09:46

Jon Skeet

1,421,763
867
9,128
9,194

I use JODConverter to convert to PDF. When I try to convert my source document directly, the PDF contains what I see when I open the word document with a text editor: gibberish. However, when I change the charset of the word document first, it converts more or less correctly (with the exception of certain characters like mentioned before). It's a word document. When I use my BufferedReader to read the converted file again, it prints gibberish, but with correct characters (like the Ü). – bortdc May 21 '14 at 09:52
1

@bortdc: It's still very unclear what you've *really* started with - a Word document or plain text. Forget about the PDF part to start with - focus on what your source document is. If it's genuinely a Word document (so if you open it up in a plain text editor it has gibberish) then you need to abandon your approach of reading it with `BufferedReader` entirely... and maybe just try a different converter. It's not clear what you mean by "when I change the charset of the word document first" by the way. – Jon Skeet May 21 '14 at 10:10

Converted word document (from Windows-1252 to UTF-8) not displaying characters correctly

1 Answers1