Snapshot testing PDFs

Question

The Problem

I'm trying to do some really quick snapshot testing of PDFs. Our system generates them using Spire.PDF and sends them to DocuSign encoded in base64. My snapshots take the generated base64 and compare it against "known good" to see if anything changed.

This process has been, so far, 100% reproducible on my computer and a co-workers computer. But when we put the tests up on Jenkins, the tests fail. Looking deeper into the problem, the same test cases with the same base documents generate completely different bytes, while also generating identical looking PDFs.

Some Details

Or things I've ruled out. Or other thoughts.

At some point in the process I have to convert to string and back, and I've checked that the encoding on my computer and Jenkins is the same.
I've replaced any "random" meta data values like created date and ID in the PDF with static values so snapshots are possible.
I'm open to other ideas of snapshot testing that can be done quickly! I'm looking for a "quick win" to reduce the amount of regression testing we need to do every release on this stuff, and not a "comprehensive testing solution".
The build environment is destroyed every build and recreated, while my computer is not.
I haven't checked that the versions of Spire are the same (because it's a weekend and I don't have dev ops permissions and politics...), but both processes are getting the library using NuGet so I don't have any reason why they might be different.
We're licensing Spire using the embedded resource method, and neither generated PDF implies it is a trial version.

*Identical looking pdfs* can easily consist of completely different bytes. If you're interested in *identical looks* only, you should simply render the pdf pages as bitmaps and compare these bitmaps. — mkl, Nov 23 '19 at 21:22
That's a good idea. I've been playing with comparing the text extracted from each PDF, but the fact that some parts of the PDFs might be images troubles me. Your plan reduces that trouble. — rythos42, Nov 23 '19 at 23:07
Both text comparison and image comparison still failed. I'm starting to think this is a "me problem", rather than a "SO problem". — rythos42, Nov 24 '19 at 04:40
Why does the text comparison fail? Is really different text extracted, or is it just the kind or amount of white space? E.g. if part of a single line of text is actually very slightly moved down, most text extractors interpret that as multiple lines while you still perceive it as a single one. And the bitmap might also slightly differ — mkl, Nov 24 '19 at 07:30
It was white space. Have thoughts on solving that? I felt like if I was going to build an ignore-white-space comparison, that this was barking up the wrong tree. I'm still not clear on how the same template with the same input can produce the same PDF on my computer every time, but fail on Jenkins, so my next plan was to somehow dig deeper into that. — rythos42, Nov 24 '19 at 16:14
I don't know your application. One possible cause would be dependency on locally installed fonts, another one different kerning rules due to different locales. But this obviously is pure guesswork. — mkl, Nov 24 '19 at 17:20
Your guesswork is much appreciated, it gives me a sense of the depth of the problem I'm trying to solve. — rythos42, Nov 24 '19 at 18:18
I also found this article -- https://stackoverflow.com/questions/21990255/itextsharp-comparing-2-pdfs-for-equality. Thanks very much for the discussion, I've decided to go with the ignore-space comparison and accept that it is sub-optimal. I'm going to mark this as a duplicate. Is there anything else I should do with it, as we don't have an "answer"? — rythos42, Nov 24 '19 at 22:53
On different platforms, the end-of-line (EOLN) sequence is often a difference for simple-text compare. Since the end of line is significant for comparing, you should probably replace all "\r\n" with "\n" then compare. You may also need to check character encoding since multi-byte chars can be a similar problem. Font is not relevant if you are extracting text. — Paul Jowett, Nov 25 '19 at 04:51
Rythos42, *"Is there anything else I should do with it, as we don't have an "answer"?"* - it sufficed. Add you can see now, your question visibly is marked as a duplicate. — mkl, Nov 25 '19 at 05:59

Snapshot testing PDFs

The Problem

Some Details

0 Answers0