How to change font encoding when converting docx -> pdf with docx4j?

Question

When I'm a converting docx document to pdf my national characters transform into "#" marks.
Is there any way to set a font encoding for pdf documents?

I used xdocreport in the past and it can handle that, but I had problems with images, headers and footers.

Docx4j manages to do this, but not fonts. After conversion, fonts have ANSI encoding while I'd like to have windows-1250. Is there an option to set this?

this question is related to http://stackoverflow.com/questions/29607496/how-to-handle-special-characters-when-converting-from-html-to-docx — Cláudio, Apr 13 '15 at 14:14

score 5 · Accepted Answer · answered Sep 10 '12 at 08:19

5

My problem was - missing proper True Type Fonts on linux server. The default fonts where inserted instead (without my code pages).

I solved the problem installing the default Ms Windows fonts via ttf-mscorefonts-installer

On debian:

apt-get install ttf-mscorefonts-installer

answered Sep 10 '12 at 08:19

robson

1,623
8
28
43

For cases where you can't do that, docx4j has a concept of a font mapper which allows you to map a document font to an available physical font. – JasonPlutext Sep 11 '12 at 21:29
After installing all the fonts, then also the fonts style of the word document changes to the default fonts style "Arial" after converting to PDF. – Vinayak Mittal May 07 '21 at 07:37
@JasonPlutext how would you do that, do you have an example? – videomugre Aug 15 '22 at 19:09
1

https://github.com/plutext/docx4j/blob/VERSION_11_4_7/docx4j-samples-docx-export-fo/src/main/java/org/docx4j/samples/ConvertOutPDFviaXSLFO.java#L157 – JasonPlutext Aug 15 '22 at 20:13
Awesome, thank you, this helped too: https://www.programcreek.com/java-api-examples/?api=org.docx4j.fonts.PhysicalFonts – videomugre Aug 15 '22 at 20:38

score 2 · Answer 2 · answered Aug 29 '13 at 07:30

I have the same problem and found, that as you mentioned by yourself, a font problem. The font on the system needs to support your encoding.

e.g: for documents using the "Arial" font, german umlaut characters are shown as "?".

I found an other solution, to override the PDF font encoding as following:

    //
    // read template
    //
    File docxFile = new File(System.getProperty("user.dir") + "/" + "Test.docx");
    InputStream in = new FileInputStream(docxFile);

    // 
    // prepare document context
    //
    IXDocReport report = XDocReportRegistry.getRegistry().loadReport(in, TemplateEngineKind.Velocity);
    IContext context = report.createContext();
    context.put("name", "Michael Küfner");

    // 
    // generate PDF output
    //
    Options options = Options.getTo(ConverterTypeTo.PDF).via(ConverterTypeVia.XWPF);
    PdfOptions pdfOptions = PdfOptions.create();
    pdfOptions.fontEncoding("iso-8859-15");
    options.subOptions(pdfOptions);     


    OutputStream out = new FileOutputStream(new File(docxFile.getPath() + ".pdf"));
    report.convert(context, options, out);

Try setting the attribute in pdfOptions.fontEndcoding (in my case "iso-8859-15") to your needs.

Setting this to "UTF-8", which seams to be the default, resulted in the same problem with special chars.

Another thing I found:

Using the "Calibri" font, which is default for Word 2007/2010, the problem did not occur, even when using UTF-8 encoding. Maybe the embedded Type-1 Arial Font in iText, which is used for generating PDFs, does not support UTF-8 encoding.

Please pay attention with XDocReport 1.0.3 (not released) because we have done a big refactoring with Font (it could change with Calibri) — Angelo, Aug 29 '13 at 07:45
@minni As I wrote in my question, I decided to give up with xdocreport for converting docx -> pdf because of other issues, not encoding. Xdocreport manages well with that. The problem was with docx4j - but I found described above solution. Anyway, thanks for tip with Calibri — robson, Aug 29 '13 at 11:38

How to change font encoding when converting docx -> pdf with docx4j?

2 Answers2

Linked