transforming arabic pdf to image without losing data

Question

I'm trying to transform a PDF with Arabic characters to an image to use as thumbnails on my webpage . PDFrenderer, PDFBox both couldn't keep Arabic characters after transformation , I only managed to get a satisfying result using JMagick, but the problem is that it uses a dll and some other dependencies I have no right to add to my application installer .

Are there better open source solution I may have missed , worst case what are the best payable solutions out there ?

thanks

here's my pdf mock file :

pdf file

Please provide a sample PDF file as used by you. Maybe there is some peculiarity about the PDF which should be fixed in a pre-processing step before transformation to image. — mkl, May 22 '13 at 11:14
imagemagick's: convert result.pdf result.png - works for me on Linux (they do have a windows version convert.exe, just dont get mixed up with windows's own convert.exe). — Alvin K., May 23 '13 at 07:52

score 0 · Answer 1 · answered May 23 '13 at 08:30

(I post this as an answer because it's too long for a comment, even though it merely is an analysis of the given sample PDF)

There actually are at least two issues when PDFBox tries to render your sample PDF.

For all Latin letters and all numbers, the original font is replaced by a default font. Cf. the log outputs like this:
```
23.05.2013 09:15:48 org.apache.pdfbox.pdmodel.font.PDSimpleFont drawString
WARNUNG: Changing font on <H> from <Arial> to the default font
```
This is due to PDFBox' way of rendering text in combination with the very limited information available for the embedded font.

PDFBox makes use of the JRE's text rendering capabilities in a way requiring first transforming the text information to Unicode and then rendering these unicode characters. The embedded font does not include any encoding or mapping information, though.

Transforming to Unicode accidentally succeeds because PDFBox uses a fallback which simply assumes some default encoding. Rendering fails, though, as the JRE code does not have any information which glyph to use for which Unicode character.

For all arabic text, the embedded font cannot be read and, therefore, Arial is used instead:

23.05.2013 09:15:48 org.apache.pdfbox.pdmodel.font.PDCIDFontType2Font getawtFont
INFO: Can't read the embedded font HYMDAA+ArialMT-Identity-H
23.05.2013 09:15:58 org.apache.pdfbox.pdmodel.font.PDType0Font getawtFont
INFO: Using font Arial instead of HYMDAA+ArialMT-Identity-H

Here already parsing the embedded font fails. Internally an Exception is thrown by the JRE code:

java.awt.FontFormatException: Font name not found
    at sun.font.TrueTypeFont.init(TrueTypeFont.java:527)
    at sun.font.TrueTypeFont.<init>(TrueTypeFont.java:162)
    at sun.font.FontManager.createFont2D(FontManager.java:2474)
    at java.awt.Font.<init>(Font.java:570)
    at java.awt.Font.createFont(Font.java:896)
    at org.apache.pdfbox.pdmodel.font.PDCIDFontType2Font.getawtFont(PDCIDFontType2Font.java:81)
    ...

I'm not very knowledgeable concerning font internals and, therefore, do not know whether the JRE code is somewhat over-sensitive here or whether the embedded font is really broken. It seems to be fishy, though.

Issue 1 looks like a shortcoming of PDFBox (expecting to be able to do the roundtrip from glyph to Unicode and back to glyph without loss is quite naive in the world of PDF). Other renderers using a less naive approach, therefore, are quite likely to succeed in properly displaying the text affected by this issue..

Issue 2, on the other hand, might turn out a hindrance for many renderers.

I would suggest trying to tweak the PDF creation process to include more complete font information.

score 0 · Answer 2 · answered May 23 '13 at 10:35

0

ABCpdf .NET will do this type of conversion.

It supports all those normally-unsupported features like Arabic, Type 3 fonts, gradients, unusual color spaces, spot colors and PostScript functions.

Here's your PDF converted to PNG using ABCpdf .NET. enter image description here

I work on the ABCpdf .NET software component so my replies may feature concepts based around ABCpdf. It's just what I know. :-)

answered May 23 '13 at 10:35

OnceUponATimeInTheWest

1,164
8
9

he asked about java :) . – Yassering Nov 10 '13 at 11:24

transforming arabic pdf to image without losing data

2 Answers2