-1

I'm trying to transform a PDF with Arabic characters to an image to use as thumbnails on my webpage . PDFrenderer, PDFBox both couldn't keep Arabic characters after transformation , I only managed to get a satisfying result using JMagick, but the problem is that it uses a dll and some other dependencies I have no right to add to my application installer .

Are there better open source solution I may have missed , worst case what are the best payable solutions out there ?

thanks

here's my pdf mock file :

pdf file

Gopakumar N G
  • 1,775
  • 1
  • 23
  • 40
Genjuro
  • 7,405
  • 7
  • 41
  • 61
  • 1
    Please provide a sample PDF file as used by you. Maybe there is some peculiarity about the PDF which should be fixed in a pre-processing step before transformation to image. – mkl May 22 '13 at 11:14
  • i joined the pdf file i'm using . – Genjuro May 22 '13 at 13:02
  • imagemagick's: convert result.pdf result.png - works for me on Linux (they do have a windows version convert.exe, just dont get mixed up with windows's own convert.exe). – Alvin K. May 23 '13 at 07:52

2 Answers2

0

(I post this as an answer because it's too long for a comment, even though it merely is an analysis of the given sample PDF)

There actually are at least two issues when PDFBox tries to render your sample PDF.

  1. For all Latin letters and all numbers, the original font is replaced by a default font. Cf. the log outputs like this:

    23.05.2013 09:15:48 org.apache.pdfbox.pdmodel.font.PDSimpleFont drawString
    WARNUNG: Changing font on <H> from <Arial> to the default font
    

    This is due to PDFBox' way of rendering text in combination with the very limited information available for the embedded font.

    PDFBox makes use of the JRE's text rendering capabilities in a way requiring first transforming the text information to Unicode and then rendering these unicode characters. The embedded font does not include any encoding or mapping information, though.

    Transforming to Unicode accidentally succeeds because PDFBox uses a fallback which simply assumes some default encoding. Rendering fails, though, as the JRE code does not have any information which glyph to use for which Unicode character.

  2. For all arabic text, the embedded font cannot be read and, therefore, Arial is used instead:

    23.05.2013 09:15:48 org.apache.pdfbox.pdmodel.font.PDCIDFontType2Font getawtFont
    INFO: Can't read the embedded font HYMDAA+ArialMT-Identity-H
    23.05.2013 09:15:58 org.apache.pdfbox.pdmodel.font.PDType0Font getawtFont
    INFO: Using font Arial instead of HYMDAA+ArialMT-Identity-H
    

    Here already parsing the embedded font fails. Internally an Exception is thrown by the JRE code:

    java.awt.FontFormatException: Font name not found
        at sun.font.TrueTypeFont.init(TrueTypeFont.java:527)
        at sun.font.TrueTypeFont.<init>(TrueTypeFont.java:162)
        at sun.font.FontManager.createFont2D(FontManager.java:2474)
        at java.awt.Font.<init>(Font.java:570)
        at java.awt.Font.createFont(Font.java:896)
        at org.apache.pdfbox.pdmodel.font.PDCIDFontType2Font.getawtFont(PDCIDFontType2Font.java:81)
        ...
    

    I'm not very knowledgeable concerning font internals and, therefore, do not know whether the JRE code is somewhat over-sensitive here or whether the embedded font is really broken. It seems to be fishy, though.

Issue 1 looks like a shortcoming of PDFBox (expecting to be able to do the roundtrip from glyph to Unicode and back to glyph without loss is quite naive in the world of PDF). Other renderers using a less naive approach, therefore, are quite likely to succeed in properly displaying the text affected by this issue..

Issue 2, on the other hand, might turn out a hindrance for many renderers.

I would suggest trying to tweak the PDF creation process to include more complete font information.

mkl
  • 90,588
  • 15
  • 125
  • 265
0

ABCpdf .NET will do this type of conversion.

It supports all those normally-unsupported features like Arabic, Type 3 fonts, gradients, unusual color spaces, spot colors and PostScript functions.

Here's your PDF converted to PNG using ABCpdf .NET.enter image description here

I work on the ABCpdf .NET software component so my replies may feature concepts based around ABCpdf. It's just what I know. :-)