ă ș ț characters missing from pdf generated from html with PdfWriter

Question

I am trying to convert some html content to a pdf using the itext PdfWriter, like this:

Document document = new Document();
ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
PdfWriter writer = PdfWriter.getInstance(document, outputStream);
document.open();
InputStream stream = new ByteArrayInputStream(content.getBytes(StandardCharsets.UTF_8));
XMLWorkerHelper.getInstance().parseXHtml(writer, document, stream, Charset.forName("UTF-8"));
document.close();

but the ă ș ț charaters are missing from the generated pdf. I have tried setting the encoding or the font, but with no luck. What I tried was to use a font provider and set it as a param to the parseXHtml method.

I set the encoding, but nothing changed.

XMLWorkerFontProvider fontProvider = new XMLWorkerFontProvider();
fontProvider.setUseUnicode(true);
fontProvider.defaultEncoding = BaseFont.CP1257;

I also tried setting the font, but it was not applied to the pdf.

XMLWorkerFontProvider fontProvider = new XMLWorkerFontProvider(XMLWorkerFontProvider.DONTLOOKFORFONTS);
fontProvider.register(PATH_TO_TTF_FONT_FILE_HOSTED_ON_S3);

And then set the param for parseXHtml.

XMLWorkerHelper.getInstance().parseXHtml(writer, document, stream, Charset.forName("UTF-8"), fontProvider);

Is there any way I could use the PdfWriter to convert all characters correctly from html to pdf?

UTF-8, that you are using, don't have those characters, try UTF-16. — res, Mar 10 '20 at 08:29
@res: ... no, UTF-8 is perfectly capable of encoding all *possible* characters in Unicode. `ă`, for example, is encoded as [0xC4 0x83](https://www.fileformat.info/info/unicode/char/0103/index.htm). It is far more likely that the font in use does not *have* those characters. — Jongware, Mar 10 '20 at 09:28
@res actually this page itself is encoded in UTF-8 (at least for me) and I can see the symbols.... — user85421, Mar 10 '20 at 09:36
@aniri could you please share the HTML or abstract version of it so we could work on it? — shihabudheenk, Mar 11 '20 at 02:38

ă ș ț characters missing from pdf generated from html with PdfWriter

0 Answers0