0

I am using qpdf to check if Encoding and ToUnicode is properly set up (encoded) for a PDF file by using the following command and look for 'ToUnicode' word in the text file. The purpose is to make sure that ligatures within a file can be decoded properly on a PDF viewer such as Adobe Acrobat Reader, pdf.js, pdfium etc.

qpdf --stream-data=uncompress input.pdf output.txt

Is this the right way? What is recommended?

Martin Thoma
  • 124,992
  • 159
  • 614
  • 958
Jun
  • 2,942
  • 5
  • 28
  • 50
  • What do you mean by "properly set up"? And what by "look for ToUnicode"? – mkl Dec 21 '18 at 21:15
  • @mkl properly encoded, I gues,s so that when it is opened on a pdf viewer, ligature text can be converted into respective characters. By 'look for ToUnicode', I just meant to look for the 'ToUnicode' word in the qpdf-generated text file. – Jun Dec 21 '18 at 21:24
  • Well, that is not enough - each font in a PDF can have a different encoding, so each may require a **ToUnicode** map. Furthermore, those maps may be incomplete or incorrect. so you have to check in a much more context sensitive manner. – mkl Dec 21 '18 at 21:43
  • @mkl thanks for the info. If I give you a pdf and a specific line that I am interested in knowing, could you use that as an example to explain if it has complete and correct ToUnicode if it does have? I would like to learn how to identify. – Jun Dec 21 '18 at 21:51
  • 1
    Whether a **ToUnicode** map is complete, cannot be judged based on the map alone. – mkl Dec 22 '18 at 07:38

1 Answers1

0

This is quite a difficult task.

Your document can include multiple fonts, some with a ToUnicode cmap and some without and all of them can be valid.

Then for the fonts that include the ToUnicode cmap you have to check that all character IDs used with that font are also present in the ToUnicode cmap.

And last step is to check that each character id is mapped to the right character (characters for ligature). This is impossible to be done automatically because you don't know what character is represented by some id. For example glyph 'A' is represented by character id 1 when text is displayed on the page. But in the ToUnicode cmap character id 1 is mapped to character 'B'. This is a logical error that cannot be verified automatically.

Mihai Iancu
  • 1,818
  • 2
  • 11
  • 10