Ghostscript pdf conversion makes ligatures unable to copy & paste

Question

I have a pdf (created with latex with \usepackage[a-2b]{pdfx}) where I am able to correctly copy & paste ligatures, i.e., "fi" gets pasted in my text editor as "fi". The pdf is quite large, so I'm trying to reduce its size with this ghostscript command:

gs -dPDFA-2 -dBATCH -dNOPAUSE -sPDFACompatibilityPolicy=1 -sDEVICE=pdfwrite 
   -dPDFSETTINGS=/printer -sProcessColorModel=DeviceRGB 
   -sColorConversionStrategy=UseDeviceIndependentColor 
   -dColorImageDownsampleType=/Bicubic -dAutoRotatePages=/None 
   -dCompatibilityLevel=1.5 -dEmbedAllFonts=true -dFastWebView=true 
   -sOutputFile=main_new.pdf main.pdf

While this produces a nice, small pdf, now when I copy a word with "fi", I instead (often) get "ő".

Since the correct characters are somehow encoded in the original pdf, is there some parameter I can give ghostscript so that it simply preserves this information in the converted pdf?

I'm using ghostscript 9.27 on macOS 10.14.

score 0 · Accepted Answer · edited Dec 03 '22 at 19:21

0

Without seeing your original file, so that I can see the way the text is encoded, it's not possible to be definitive. It certainly is not possible to have the pdfwrite device 'preserve the information'; for an explanation, see here.

If you original PDF file has a ToUnicode CMap then the pdfwrite device should use that to generate a new ToUnicode CMap in the output file, maintaining cut&paste/search. If it doesn't then the conversion process will destroy the encoding. You might be able to get an improvement in results by setting SubsetFonts to false, but it's just a guess without seeing an example.

My guess is that your original file doesn't have a ToUnicode CMap, which means that it's essentially only working by luck.

edited Dec 03 '22 at 19:21

halfer

19,824
17
99
186

answered Jan 07 '20 at 19:27

KenS

30,202
3
34
51

Thanks for the explanation. How can I verify whether or not my pdf has the ToUnicode CMap? – cod3licious Jan 08 '20 at 15:00
Look inside it :-) You'll need to open it with an editor, if it uses compressed object streams then you'll need to decompress it first. Or more simply, post it somewhere public, put a link here, and we can look at it for you. – KenS Jan 08 '20 at 19:22
Ok, I didn't find "ToUnicode" anywhere in the pdf (even though I used the respective latex packages... -.-), so I suppose there is nothing ghostscript could have done. Weirdly enough, while a search in the pdf failed for the whole pdf (~100 pages), when I extracted only a single page to upload somewhere, "fi" was correctly recognized there again.... – cod3licious Jan 10 '20 at 12:39
It will depend on how the font is re-encoded, which is pretty much unpredictable as it depends on how its used in the document. Changing the pages will change the way the font is used, and therefore the way its re-encoded. – KenS Jan 10 '20 at 15:19

Ghostscript pdf conversion makes ligatures unable to copy & paste

1 Answers1