1

I have an automated process that generates pdfs which we then compare to a known version via approval tests to verify nothing in that pipeline is broken. I normalize mismatching fields like created/modified date and timezone and locally everything always matches up 100%. However for some reason, pdfs generated on our build server are very different from those I generate locally with sometimes the ones I generate locally being as much as 20% larger.

The first difference when comparing the files in winmerge is the /FontName field which looks like this:

Locally Generated

/FontName/QOAAAA+TimesNewRomanRegular

Build Server Generated

/FontName/QYAAAA+TimesNewRomanRegular

after that we have differences in /FontBBox, length, and binary data. I see several blocks of this.

My suspicion is that slightly different fonts are available on and being selected on the two machines and being embedded into the pdf but I have not idea what the Q*AAAA code above means nor how to verify that hypothesis.

Edit:

pdffonts reports identical fonts in both but couldn't that just be different versions of the same embedded font?

W:\xpdfbin-win-3.03\bin64> .\pdffonts.exe w:\...\PhantomRasterizer\Can_rasterize_html_to_pdf.slide_with_table_and_svg.approved.pdf
name                                 type              emb sub uni object ID
------------------------------------ ----------------- --- --- --- ---------
TimesNewRomanRegular                 CID TrueType      yes no  yes      7  0
ArialBold                            CID TrueType      yes no  yes     12  0
ArialRegular                         CID TrueType      yes no  yes     17  0
W:xpdfbin-win-3.03\bin64> .\pdffonts.exe W:\...\PhantomRasterizer\Can_rasterize_html_to_pdf.slide_with_table_and_svg.received.pdf
name                                 type              emb sub uni object ID
------------------------------------ ----------------- --- --- --- ---------
TimesNewRomanRegular                 CID TrueType      yes no  yes      7  0
ArialBold                            CID TrueType      yes no  yes     12  0
ArialRegular                         CID TrueType      yes no  yes     17  0
George Mauer
  • 117,483
  • 131
  • 382
  • 612
  • Can you compare the exact fonts that are being used in each location? It could be something like a wider UTF character set. It could be something as simple as a different version of the same font. – David Woods May 27 '14 at 20:02
  • @DavidWoods How do i find out what fonts are embedded into the pdf? – George Mauer May 27 '14 at 20:03
  • 1
    @GeorgeMauer http://stackoverflow.com/questions/614619/how-to-find-out-which-fonts-are-referenced-and-which-are-embedded-in-a-pdf-docum – admdrew May 27 '14 at 20:05
  • You can try using @font-face CSS to force the use of a specific, locally provided font. That could eliminate binary font differences as the culprit. – David Woods May 27 '14 at 20:12
  • For context, David and I have discussed that I'm generating these pdfs via phantomjs and postprocessing them with pdfsharp – George Mauer May 27 '14 at 20:14
  • @admdrew thanks for the tip. `pdffonts` reports the same fonts (see edit) could they be different versions of the same font though? – George Mauer May 27 '14 at 20:20
  • 1
    I found out the six characters at the beginning are randomly generated, and their presence indicates that the embedding is of a subset of the font, rather than the whole. Are these definition lines different themselves? The character set should be included in the definition here. – David Woods May 27 '14 at 20:25
  • The PDF 1.7 specs say this of FontBBox: `(Required, except for Type 3 fonts) A rectangle (see 7.9.5, "Rectangles"), expressed in the glyph coordinate system, that shall specify the font bounding box. This should be the smallest rectangle enclosing the shape that would result if all of the glyphs of the font were placed with their origins coincident and then filled.` The fact that you're getting a different FontBBox could indicate different character subsets or different binary data the fonts are built from. – David Woods May 27 '14 at 20:28
  • If you can supply example documents, it should be easier to figure out whether the fonts are the same or not. – David van Driessche May 28 '14 at 11:57
  • Also, does your document contain a time stamp or any other reference to the machine it was generated on on the actual pages in the document? (for example, is there an automatically generated footer that has the date/time at the bottom)? – David van Driessche May 28 '14 at 11:58

1 Answers1

1

Please read my answer to this question: Why are PDF files different even if the content is the same?

Your question is the equivalent of "Why is the order of entries in a HashMap different on different JVMs?" The answer is simple: because HashMaps are designed that way. A HashMap is not a TreeMap.

You are now focusing on Fonts, more specifically font subsets (regarding the random characters in the name of the font subset ISO-32000-1 states "the choice of letters is arbitrary", so you're contesting the ISO standard in your question). However, this is the least of your troubles. The IDs of a PDF should be different too, the order of entries in dictionaries are like the entries in a HashMap. Read section 7.3.7 of ISO-32000-1:

The entries in a dictionary represent an associative table and as such shall be unordered even though an arbitrary order may be imposed upon them when written in a file. That ordering shall be ignored.

The same goes for object numbers. I've seen tests that check if the object with object number 1 is this or that dictionary, and the object with object number 2 is this or that array. However: object numbers don't matter. You can create a PDF document one one system where the first object is a dictionary and the second one an array, and the same PDF document using the same code in which it's the other way around. We recently noticed that one of our tests was bad when testing our software with Java 8 instead of Java 7. You can have the same problem with your tests as soon as you change the JVM.

Your validation is wrong. When we test PDFs, we use a completely different approach.

Community
  • 1
  • 1
Bruno Lowagie
  • 75,994
  • 9
  • 109
  • 165
  • I agree with your statement in the general sense, but not in the framework of the question of the OP. If the same automated process is used in both environments, it's reasonable to expect that the file could be compared at a much lower level then generally would work; that's also what the detail provided by the OP suggests. – David van Driessche May 28 '14 at 12:00
  • Thanks so much for your help. As David states, these files are being generated by *the same* automated process just on different machines. I'm aware of the ID, dates, and timezones [and normalize for these](https://github.com/togakangaroo/ApprovalTests.BetterPdfVerification/blob/master/ApprovalTests.BetterPdfVerification/PdfApprovals.cs). This worked perfectly until we added developers to the team and started running tests in other locations. Now it looks like I need to normalize for Sounds like you might have a better suggestion for how to verify 2 pdfs are visually identical? – George Mauer May 28 '14 at 15:01
  • 1
    We use a reference PDF and create an image from that PDF. We then generate a new PDF and create an image for that new PDF. Finally, we compare both images at a pixel-per-pixel basis. This way it doesn't matter if the PDF syntax draws a line from left to right or from right to left, as long as the line is correct, the test passes (which is what we want). We also crawl through the COS-model to check stuff such as annotations. – Bruno Lowagie May 28 '14 at 15:26
  • Man...I was hoping you wouldn't say that. Oh well, convert-to-tiff it is :) – George Mauer May 28 '14 at 15:33
  • Different IDs, hash, ordering, etc. I understand. But the size of all of these should be the same. What is a six character ID on one system is a six character ID on another system. Hashes, by definition, produce a fixed-length output. So these differences don't explain a 20% difference in file size. It also wouldn't explain a different FontBBox. – David Woods May 28 '14 at 15:37