Docx to Pdf with replaced characters

Question

I have a docx file with Chinese characters and other Asian languages. I am able to convert the docx file into a PDF file perfectly on my laptop with the Chinese characters embedded properly into the PDF, but when the same code is run as runable jar on the Linux server, the Chinese characters are replaced with # symbol. Can someone please guide me with this problem? Thank you for the help in advance. The java code is given below

public static void main(String[] args) throws Exception {

    try {

        Docx4jProperties.getProperties().setProperty("docx4j.Log4j.Configurator.disabled", "true");
        Log4jConfigurator.configure();
        org.docx4j.convert.out.pdf.viaXSLFO.Conversion.log.setLevel(Level.OFF);

        System.out.println("Getting input Docx File");
        InputStream is = new FileInputStream(new File(
                "C:/Users/nithins/Documents/plugin docx to pdf/other documents/Contains Complex Fonts Verified.docx"));
        WordprocessingMLPackage wordMLPackage = WordprocessingMLPackage.load(is);
        wordMLPackage.setFontMapper(new IdentityPlusMapper());

        System.out.println("Setting File Encoding");
        System.setProperty("file.encoding", "Identity-H");
        System.out.println("Generating PDF file");

        org.docx4j.convert.out.pdf.PdfConversion c = new org.docx4j.convert.out.pdf.viaXSLFO.Conversion(
                wordMLPackage);
        File outFile = new File(
                "C:/Users/nithins/Documents/plugin docx to pdf/other documents/Contains Complex Fonts Verified.pdf");
        OutputStream os = new FileOutputStream(outFile);
        c.output(os, new PdfSettings());
        os.close();

        System.out.println("Output pdf file generated");
    } catch (Exception e) {
        e.printStackTrace();
    }

}

public static String changeExtensionToPdf(String path) {
    int markerIndex = path.lastIndexOf(".docx");
    String pdfFile = path.substring(0, markerIndex) + ".pdf";
    return pdfFile;
}

You use a java solution for that docx to pdf conversion. That's all you tell us. So all we can say is that you seem to do something wrong in that solution. — mkl, May 04 '17 at 10:53
Ok, so you use [tag:docx4j]. I added that tag. Unfortunately I don't know that product at all. Just one remark: `System.setProperty("file.encoding", "Identity-H")` should not make any sense at all, **Identity-H** is a PDF internal thing; the system property "file.encoding" refers to text files in general, and, therefore, *not* to PDFs which after all are binary files, not text files. Furthermore, it is weird that you set the log level to off even though you still run into trouble, after all there might be log outputs which could help you. — mkl, May 04 '17 at 12:01

score 1 · Answer 1 · answered May 07 '17 at 11:06

Copied from docx4j's "Getting Started" documentation:

docx4j can only use fonts which are available to it.

These fonts come from 2 sources:
•   those installed on the computer
•   those embedded in the document

Note that Word silently performs font substitution.  When you open an existing document in 
Word, and select text in a particular font, the actual font you see on the screen won't be 
the font reported in the ribbon if it is not installed on your computer or embedded in the 
document.  To see whether Word 2007 is substituting a font, go into Word Options 
> Advanced > Show Document Content and press the "Font Substitution" button.  

Word's font substitution information is not available to docx4j.  As a developer, you 3 
options:
•   ensure the font is installed or embedded
•   tell docx4j which font to use instead, or
•   allow docx4j to fallback to a default font

To embed a font in a document, open it in Word on a computer which has the font installed 
(check no substitution is occuring), and go to Word Options > Save > Embed Fonts in File.

If you want to tell docx4j to use a different font, you need to add a font mapping.  The 
FontMapper interface is used to do this.

On a Windows computer, font names for installed fonts are mapped 1:1 to the corresponding 
physical fonts via the IdentityPlusMapper. 

A font mapper contains Map<String, PhysicalFont>; to add a font mapping, as per the example in the ConvertOutPDF sample:
    // Set up font mapper
    Mapper fontMapper = new IdentityPlusMapper();
    wordMLPackage.setFontMapper(fontMapper);

    // .. example of mapping font Times New Roman which doesn't have certain Arabic glyphs
    // eg Glyph "ي" (0x64a, afii57450) not available in font "TimesNewRomanPS-ItalicMT".
    // eg Glyph "ج" (0x62c, afii57420) not available in font "TimesNewRomanPS-ItalicMT".
    // to a font which does
    PhysicalFont font 
            = PhysicalFonts.get("Arial Unicode MS"); 
        // make sure this is in your regex (if any)!!!
    if (font!=null) {
        fontMapper.put("Times New Roman", font);
        fontMapper.put("Arial", font);
    }

You'll see the font names if you configure log4j debug level logging for
 org.docx4j.fonts.PhysicalFonts

If you turn logging on for org.docx4j.fonts, it should tell you about the missing glyphs. See https://github.com/plutext/docx4j/blob/master/src/main/java/org/docx4j/fonts/GlyphCheck.java

Docx to Pdf with replaced characters

1 Answers1