1

I want to convert this PDF file compiled with LaTeX (XeLaTeX engine so that to use an Arabic font) and I want to upload it to the web and prevent copy and paste of its content.

Since I am looking for a freeware to do that, I came across two powerful beasts to do this job, namely, ImageMagick and Ghostscript. All what I need is to convert one text PDF to image PDF in one go, preferably with batch processing if possible (to convert many PDFs in one go).

I run this code in command line and it works fine for English-written PDFs:

convert someenglish.pdf output.pdf  

Now when I do the same for an Arabic PDF I get this error:

convert.exe: PDFDelegateFailed `[ghostscript library] -q -dQUIET -dSAFER -dBATCH
 -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000 -dAlignToPixels=0 -dGridFitTT=2 "-sD
EVICE=pngalpha" -dTextAlphaBits=4 -dGraphicsAlphaBits=4 "-r72x72"  "-sOutputFile
=C:/Users/doctorate/AppData/Local/Temp/magick-65203BNMxTDhXtkF%d" "-fC:/Users/doctorate/Ap
pData/Local/Temp/magick-65206AK54hOoKA62" "-fC:/Users/doctorate/AppData/Local/Temp/ma
gick-6520hDn-KMyTyxy2"':    **** Error reading a content stream. The page may be
 incomplete.
   **** Incorrect object count in object stream.
Error: /rangecheck in resolveobjectstream
Operand stack:
   78424   10   1   10   --dict:7/15(L)--   26   --nostringval--   35   --nostri
ngval--   --dict:2/2(L)--   --dict:2/2(L)--   --dict:2/2(L)--   --dict:2/2(L)--
  --dict:4/4(L)--   --dict:4/4(L)--   --dict:4/4(L)--   --dict:4/4(L)--   --dict
:4/4(L)--   --dict:3/3(L)--   --dict:2/2(L)--   --nostringval--   --dict:7/7(L)-
-   --dict:10/10(L)--   --nostringval--   --nostringval--   Type   Font   Subtyp
e   CIDFontType2   BaseFont   MYCROL+(AH
Execution stack:
   %interp_exit   .runexec2   --nostringval--   --nostringval--   --nostringval-
-   2   %stopped_push   --nostringval--   --nostringval--   --nostringval--   fa
lse   1   %stopped_push   1983   1   3   %oparray_pop   1982   1   3   %oparray_
pop   1966   1   3   %oparray_pop   --nostringval--   --nostringval--   --nostri
ngval--   --nostringval--   --nostringval--   --nostringval--   --nostringval--
  --nostringval--   --nostringval--
Dictionary stack:
   --dict:1193/1684(ro)(G)--   --dict:1/20(G)--   --dict:82/200(L)--   --dict:82
/200(L)--   --dict:116/127(ro)(G)--   --dict:280/300(ro)(G)--   --dict:24/32(L)-
-
Current allocation mode is local
GPL Ghostscript 9.15: Unrecoverable error, exit code 1
 @ error/pdf.c/InvokePDFDelegate/263.
convert.exe: no images defined `test.pdf' @ error/convert.c/ConvertImageCommand/
3210.

Question
What am I missing here? I am not a programmer, so please consider this in your answer. I am very grateful if you could show how to do this in batch process.

Notes

  • Windows 7 32bit

  • Ghostscript version 9.15

  • Quality of image is not an issue for me even 72dpi will be fine

  • I want to strike a balance between size of the output and clarity of text. I just want the text to be readable on the web and not to do some OCR processing with it, so image doesn't need to be very sharp. Size of output is more important, the less the better and honestly I am clueless as to what might works better; to convert the PDF file into PNG or into JPEG in this case.

  • I don't want to burst a PDF into multiple serially named PNGs or JPEGs, simply one PDF to another PDF but as images inside and no more copy&paste-prone text.

Update
I tried to make a minimal working example PDF to mimic the original PDF and found that problem arises by including a certain Arabic font called (AH) Manal Black. Running pdffonts from command line on this MWE PDF gives:

Syntax Error (18062): Illegal character ')'
Syntax Error (18076): Dictionary key must be a name object
Syntax Error (18085): Dictionary key must be a name object
Syntax Error (18248): Illegal character ')'
Syntax Error (18248): Dictionary key must be a name object
Syntax Error (18253): Dictionary key must be a name object
Syntax Error (18599): Illegal character ')'
Syntax Error (18599): Dictionary key must be a name object
Syntax Error (18607): Dictionary key must be a name object
name                                 type              emb sub uni object ID
------------------------------------ ----------------- --- --- --- ---------
GAKHDJ+(AH                           CID TrueType      yes yes yes      5  0
HTCSVQ+Amiri-Regular                 CID TrueType      yes yes yes      7  0

By excluding this Arabic font when compiling the document using LaTeX/XeTeX engine, the convert command works just fine like in other English PDFs. So most probably this problem is linked to parsing of the fonts.


Update: A minimally working example is here: https://www.dropbox.com/s/qdeuzips0ivas4q/mwe_ar.pdf?dl=0

Kurt Pfeifle
  • 86,724
  • 23
  • 248
  • 345
doctorate
  • 1,381
  • 1
  • 19
  • 43
  • Ghostscript is telling you there's a problem with your PDF file. It tried to recover from that problem, but it was too serious and it gave up. There's no way to tell what's wrong with the PDF file (or even whether Ghostscript is incorrect about there being a problem) without seeing the PDF. – KenS May 01 '15 at 09:41
  • I provided a minimal working example pdf in this link: https://www.dropbox.com/s/qdeuzips0ivas4q/mwe_ar.pdf?dl=0 – doctorate May 01 '15 at 10:16
  • Well all I can tell you is that something is still wrong. It 'looks like' the font has been embedded with an illegal character which hasn't been escaped, if you can put a simple example somewhere I can look at it, but the fact that pdffonts tells you there's a problem as well is pretty strong support that there's a fault. You should probably open a bug report for LaTeX (or whatever it is you use to make the PDF file) – KenS May 01 '15 at 13:01
  • Actually it looks like the problem is that the font name contains parentheses (), I doubt that's legal they probably need to be represented as escapes, ie #xxx numeric values instead of as a ( or ) – KenS May 01 '15 at 13:03
  • For the minimal example, would only make sense when I provide the source of that font which, unfortunately, I don't have. But I will try to look at what heppens when the parentheses are removed from the font name. Thanks for this tip. – doctorate May 01 '15 at 13:05
  • @doctorate: Sorry, I just discovered it :) and put it into the body of your question. Sorry, I had the web page with your question open since breakfast time, but got distracted. But I forgot to update the web page, and hence the new comments weren't visible to me... – Kurt Pfeifle May 01 '15 at 15:24

2 Answers2

1

The minimally working example has PDF object no. 10 as an ObjStm (object stream), where this part can be found (I edited the whitespace formatting for improved readability):

<<  /Type               /Font
    /Subtype            /Type0
    /BaseFont           /GAKHDJ+#28AH)#20Manal#20Black
    /Encoding           /Identity-H
    /DescendantFonts    [4 0 R]
    /ToUnicode          12 0 R
>>

So the font name, (AH) Manal Black, has properly hex-escaped the blanks as #20 and the opening parenthesis ( as #28, but it hasn't hex-escaped the closing parenthesis ) as #29, as it should.

Without knowing more about the PDF generating process, I guess that the Creator/Producer combo as given through the file's metadata,

Creator:    XeTeX output 2015.05.01:1207
Producer:   xdvipdfmx (20140317)

is to be blamed. This is a bug in the PDF generating software...


Update

Maybe I should reveal how I dissected and uncompressed the MWE PDF:

  1. Trying it with QPDF didn't work:

    qpdf --qdf --object-streams=disable mwe_ar.pdf qdf.pdf
    
     object stream 10 (file position 585): unexpected )
    
  2. Trying it with pdftk didn't work either:

    pdftk mwe_ar.pdf cat pdftk.pdf uncompress
    
     Error: Unable to find file.
     Error: Failed to open PDF file: 
        mwe_ar.pdf
     Errors encountered.  No output created.
     Done.  Input errors, so no output created.
    
  3. Trying it with MuPDF's mutool also failed:

    mutool clean -d mwe_ar.pdf mutool.pdf
    
     warning: lexical error (unexpected ')')
     error: invalid key in dict
     error: cannot parse dict
     error: cannot open object stream (10 0 R)
     error: cannot load object stream containing object (1 0 R)
     warning: cannot load object (1 0 R) into cache
     warning: lexical error (unexpected ')')
     error: invalid key in dict
     error: cannot parse dict
     error: cannot open object stream (10 0 R)
     error: cannot load object stream containing object (4 0 R)
     error: cannot load object (4 0 R) into cache
    
  4. Finally, as a last resort, PeePDF.py to the rescue:

    $ cat peepdf-commands.txt
    
     object 10
    
    $ peepdf.py -s peepdf-commands.txt
    
      << /Length 1000
      /N 13
      /Type /ObjStm
      /Filter /FlateDecode
      /First 84 >>
      stream
      9 0 3 72 11 133 2 197 1 312 15 343 4 446 14 625 19 876 6 1344 18 1514 5 1758 7 1886 <</Font<</F1 5 0 R/F2 7 0 R>>/ProcSet[/PDF/Text/ImageC/ImageB/ImageI]>>
      <</Resources 9 0 R/Type/Page/Parent 11 0 R/Contents[8 0 R]>>
      <</Type/Pages/Count 1/Kids[3 0 R]/MediaBox[0 0 595.28 841.89]>>
      <</Creator( XeTeX output 2015.05.01:1207)/Producer(xdvipdfmx \(20140317\))/CreationDate(D:20150501120749+01'00')>>
      <</Pages 11 0 R/Type/Catalog>>
      [417[251]421[257]424[368]443[470]445[355]450[380]480[322]498[480 233]505[461]508[256]514[326]520[264]]
      <</Type/Font/Subtype/CIDFontType2/BaseFont/GAKHDJ+#28AH)#20Manal#20Black/FontDescriptor 14 0 R/CIDSystemInfo<</Registry(Adobe)/Ordering(Identity)/Supplement 0>>/DW 199/W 15 0 R>>
      <</Type/FontDescriptor/Ascent 529/Descent -415/StemV 109/CapHeight 529/AvgWidth 392/FontBBox[-112 -321 1006 1137]/ItalicAngle 0/Flags 6/Style<</Panose<000000000000000000000000>>>/FontName/GAKHDJ+#28AH)#20Manal#20Black/FontFile2 16 0 R/CIDSet 17 0 R>>
      [39[693]41[522]51[535]108[415]124[415]388[218 926]402[1213]406[541]446[586]1886[317]1992[229]2016[366]2021[366]2105[244]2108[244]2139[1006]2150[295]2162[378]2227[379 452]2272[589]2294[176]2300[198]2308[389]2339[343]2356[723]2359[1079]2397[552]2413[346]2457[177]2491[299]2912[349]2952[219]2969[209]2973[148]2976[302]2981[341]3027[168]3149[550]3297[259]3325[292]3726[248]3732[319]3853[411]3893[179]4021[55]4323[104]4627[560]5068[238]5106[476]5322[159]5328[222]6366[93]]
      <</Type/Font/Subtype/CIDFontType2/BaseFont/HTCSVQ+Amiri-Regular/FontDescriptor 18 0 R/CIDSystemInfo<</Registry(Adobe)/Ordering(Identity)/Supplement 0>>/DW 190/W 19 0 R>>
      <</Type/FontDescriptor/Ascent 1123/Descent -635/StemV 87/CapHeight 1123/AvgWidth 685/FontBBox[-581 -900 11467 1815]/ItalicAngle 0/Flags 6/Style<</Panose<000000000500000000000000>>>/FontName/HTCSVQ+Amiri-Regular/FontFile2 20 0 R/CIDSet 21 0 R>>
      <</Type/Font/Subtype/Type0/BaseFont/GAKHDJ+#28AH)#20Manal#20Black/Encoding/Identity-H/DescendantFonts[4 0 R]/ToUnicode 12 0 R>>
      <</Type/Font/Subtype/Type0/BaseFont/HTCSVQ+Amiri-Regular/Encoding/Identity-H/DescendantFonts[6 0 R]/ToUnicode 13 0 R>>
    
      endstream
    

The more often I use PeePDF.py, the more I love it. Thanks, Jose Miguel, for this wonderful tool!

Kurt Pfeifle
  • 86,724
  • 23
  • 248
  • 345
  • Thanks, but using a different font caused the problem to disappear. So could it be that the XeTeX could not fallback to a default font in some unicode characters that (AH) Manal Black failed to provide?! – doctorate May 01 '15 at 15:42
  • just for curiosity, how could you view the object stream's info? – doctorate May 01 '15 at 15:44
  • what about batch processing, given that no such problem with the font? and what is better in this case to go `.png` or `.jpg` using `ghost`sricpt? – doctorate May 01 '15 at 15:47
  • @doctorate: *"using a different font caused the problem to disappear"*. I bet the different font has no (closing) parenthesis in its name then? – Kurt Pfeifle May 01 '15 at 15:55
  • True, it hasn't these unusual parentheses. This directly leads me to the question how to change the font name and remove that troublesome closing parenth? AFAIK, is not simply renaming the file name:`ah-manal-black.ttf`. – doctorate May 01 '15 at 16:07
  • BTW, the link to this great tool `PeePDF.py` is broken. – doctorate May 01 '15 at 16:10
  • The file `ah-manal-black.ttf` internally holds in its font description structure also the font name. To change the font name and remove the parenthesese from it you'd need to open and edit and save it again with a font editor, such as *FontForge*. (Maybe the commandline tool `ttx` also is already enough to change the font's internal name...) – Kurt Pfeifle May 01 '15 at 16:12
  • @doctorate: the link I gave (to PeePDF.py) works for me... The source is here: http://peepdf.googlecode.com/svn/trunk – Kurt Pfeifle May 01 '15 at 16:13
  • thanks for the hint, at least you showed me the way to freedom from all these strange names of Arabic fonts, but this way I will breach the copyright of their foundries I guess. – doctorate May 01 '15 at 16:14
  • @doctorate: You can use `ttx` to unpack the font and look at its internal copyright statement. But you should report a bug to the XeLaTeX authors -- but before that, please check if the problem is fixed in a newer version of it. – Kurt Pfeifle May 01 '15 at 16:50
1

I usually use pdftocairo to fix that:

pdftocairo corruptedinfile.pdf -pdf outfile.pdf

After that, ghostscript can handle it properly.

Mathieu Rodic
  • 6,637
  • 2
  • 43
  • 49
RedRoosterMobile
  • 786
  • 10
  • 21