2

While rendering a PDF file generated by PDFCreator 0.9.x. I noticed it contains an error in the character mapping. Now, an error in a PDF file is nothing to be wondered about, Acrobat does wonders in rendering faulty PDF files hence a lot of PDF generators create PDFs that do not adhere fully to the PDF standard.

I trief to create a small example file: http://test.continuit.nl/temp/Document.pdf

The single page renders a single glyph (a capital A) using a Tj command (See stream 5 0 obj). The font selected (7 0 obj) contains a font with a single glyph embedded. So far so good. The char is referenced by char #1. Given the Encoding of the font it contains a Differences part: [ 1 /A ]. Thus char 1 -> character /A. Now in the embedded subset font there is a cmap that matches no glyph at character 65 (eg capital A) the cmap section of the font does define the character in exactly the order in the PDF file Font -> Encoding -> Differences array.

It looks like the character mapping / encoding is done twice. Only Files from PDFCreator 0.9.x seem to be affected.

My question is: Is this correct (or did I make a mistake and is the PDF correct) and what would you do to detect this situation in order to solve the rendering problem.

Note: I do need to be able to render these PDFs..

Solution

In the ISO32000 file there is a remark that symbolic TrueType fonts (flag bit 3 is on in the font descriptor) the encoding is not allowed and you should IGNORE it, using a simple 1on1 encoding always. SO all in all, if it is a symbolic font, I ignore the Encoding object altogether and this solves the problem.

Ritsaert Hornstra
  • 5,013
  • 1
  • 33
  • 51

1 Answers1

2

The first point is that the file opens and renders correctly in Acrobat, so its almost certain that the file is correct. In fact it opens and renders correctly in a wide range of PDF consumers, so in fact it is correct.

The font in question is a TrueType font, so actually yes, there are two kinds of 'encoding'. First there is PDF/PostScript Encoding. This maps a character code into a glyph name. In your case it maps character code 1 to glyph name /A.

In a PostScript font we would then look up the name /A in the CharStrings dictionary, and that would give us the character description, which we would then execute. Things are different with a TrueType font though.

You can find this on page 430 of the 1.7 PDF Reference Manual, where it states that:

"A TrueType font program’s built-in encoding maps directly from character codes to glyph descriptions by means of an internal data structure called a “cmap” (not to be confused with the CMap described in Section 5.6.4, “CMaps”)."

I believe in your case that you simply need to use the character code (0x01) directly in the CMAP sub table. This will give you a GID of 36.

KenS
  • 30,202
  • 3
  • 34
  • 51
  • Concerning your first point: After creating a PDF renderer for a year now I can say to you: Acrobat is exceptionally well in supporting rendering faulty PDFs. Mostly missing keys in PDF objects. Especially in Font handling you see these problems. – Ritsaert Hornstra Aug 22 '11 at 09:07
  • What I see is the same minus the last paragraph: in the Content is the commando to draw char 0x01. given the font Encoding this is Postscript char /A. The TT font does not have a filled in POST section, I do use the TT cmap like any other PDF with TT fonts (not typ0) and I need to lookup character 65 given the Acrobat translation table form names to Unicode codes. Most PDFs work correctly like that, only the onse created with PDFCreator (as far as I can see now) don't need this. Note that both the Mac and MS cmap in the TT font use the same mapping. Others really need the Encoding in the PDF. – Ritsaert Hornstra Aug 22 '11 at 09:14
  • Note that on page 431 of the PDF 1.7 standard it states that you should first use the Endoding Dictionary and the Differences there and then If a (3, 1) “cmap” subtable (Microsoft Unicode) is present, first map to PS name, then to Unicode value and then use the (3,1) subtable. This is exactly the way the renderer works not and yields the wrong result (in the case of teh PDFCreateor 0.9.x files). – Ritsaert Hornstra Aug 22 '11 at 13:45
  • 1
    Replying to comment one I did say that I had tried several other PDF consumers as well as Acrobat and they too can handle this file correctly. I am not relying on Acrobat. FWIW I've been working on rendering PDF files since version 1.0 of the spec, about 15 years now. – KenS Aug 24 '11 at 16:27
  • It is not allowed to have an encoding object in de font for a symbolic font (bit 3 set in de descriptor). I tried to always use the Encoding is present but ignoring it in this case did the trick. the ISO32000 file is much more clear about this than the PDF 1.7 document from Adobe. This is not exactly the same as your answer but since it is effectively correct I will accept your answer. – Ritsaert Hornstra Jan 08 '12 at 16:26
  • The PDF 1.7 spec *recommends* you don't have an Encoding with a symbolic TrueType font, the PDF/A spec says you *must not* have an Encoding. I don't have a copy of the ISO spec to hand, though I thought it was hte same as the 1.7 spec, I guess not. In any event, its not a good idea, since it does confuse some consumers. I'm glad you found a solution. – KenS Jan 09 '12 at 16:50
  • Correct. the main problem was: this is what PDFCreator generates by default for all embedded TT fonts. I started to scan the whole ISO document and here and there it is much clearer how the handle these kind of cases. – Ritsaert Hornstra Jan 09 '12 at 20:06