Comparing PDF files with varying degrees of strictness

Question

I have two folders, each including ca. 100 PDF files resulting from different runs of the same PDF generation program. After performing some changes to this program, the resulting PDF should always stay equal and nothing should break the layout, the fonts, any potential graphs and so on. This is why I would like to check for visual equality while ignoring any metadata that might have changed due to running the program at different times.

My first approach was based on this post and attempted to compare the hashes of each file:

h1 = hashlib.sha1()
h2 = hashlib.sha1()

with open(fileName1, "rb") as file:
    chunk = 0
    while chunk != b'':
        chunk = file.read(1024)
        h1.update(chunk)

with open(fileName2, "rb") as file:
    chunk = 0
    while chunk != b'':
        chunk = file.read(1024)
        h2.update(chunk)

return (h1.hexdigest() == h2.hexdigest())

This always returns "False". I assume that this is due to different time dependent metadata, which is why I would like to ignore them. I've already found a way to set the modification and creation data to "None":

pdf1 = pdfrw.PdfReader(fileName1)
pdf1.Info.ModDate = pdf1.Info.CreationDate = None
pdfrw.PdfWriter().write(fileName1, pdf1)
    
pdf2 = pdfrw.PdfReader(fileName2)
pdf2.Info.ModDate = pdf2.Info.CreationDate = None
pdfrw.PdfWriter().write(fileName2, pdf2)

Looping through all files in each folder and running the second method before the first curiously sometimes results in a return value of "True" and sometimes in a return value of "False".

Thanks to the kind help of @jorj-mckie (see answer below), I've the following methods checking for xref equality:

doc1 = fitz.open(fileName1)
xrefs1 = doc1.xref_length() # cross reference table 1
doc2 = fitz.open(fileName2)
xrefs2 = doc2.xref_length() # cross reference table 2
    
if (xrefs1 != xrefs2):
    print("Files are not equal")
    return False
    
for xref in range(1, xrefs1):  # loop over objects, index 0 must be skipped
    # compare the PDF object definition sources
    if (doc1.xref_object(xref) != doc2.xref_object(xref)):
        print(f"Files differ at xref {xref}.")
        return False
    if doc1.xref_is_stream(xref):  # compare binary streams
        stream1 = doc1.xref_stream_raw(xref)  # read binary stream
        try:
            stream2 = doc2.xref_stream_raw(xref)  # read binary stream
        except:  # stream extraction doc2 did not work!
            print(f"stream discrepancy at xref {xref}")
            return False
        if (stream1 != stream2):
            print(f"stream discrepancy at xref {xref}")
            return False
return True

and xref equality without metadata:

doc1 = fitz.open(fileName1)
xrefs1 = doc1.xref_length() # cross reference table 1
doc2 = fitz.open(fileName2)
xrefs2 = doc2.xref_length() # cross reference table 2
    
info1 = doc1.xref_get_key(-1, "Info")  # extract the info object
info2 = doc2.xref_get_key(-1, "Info")
    
if (info1 != info2):
    print("Unequal info objects")
    return False
    
if (info1[0] == "xref"): # is there metadata at all?
    info_xref1 = int(info1[1].split()[0])  # xref of info object doc1
    info_xref2 = int(info2[1].split()[0])  # xref of info object doc1

else:
    info_xref1 = 0
            
for xref in range(1, xrefs1):  # loop over objects, index 0 must be skipped
    # compare the PDF object definition sources
    if (xref != info_xref1):
        if (doc1.xref_object(xref) != doc2.xref_object(xref)):
            print(f"Files differ at xref {xref}.")
            return False
        if doc1.xref_is_stream(xref):  # compare binary streams
            stream1 = doc1.xref_stream_raw(xref)  # read binary stream
            try:
                stream2 = doc2.xref_stream_raw(xref)  # read binary stream
            except:  # stream extraction doc2 did not work!
                print(f"stream discrepancy at xref {xref}")
                return False
            if (stream1 != stream2):
                print(f"stream discrepancy at xref {xref}")
                return False
return True

If I run the last two functions on my PDF files, whose timestamps have already been set to "None" (see above), I end up with some equality checks resulting in a "True" return value and others resulting in "False".

I'm using the reportlab library to generate the PDFs. Do I just have to live with the fact that some PDFs will always have a different internal structure, resulting in different hashes even if the files look exactly the same? I would be very happy to learn that this is not the case and there is indeed a way to check for equality without actually having to export all pages to images first.

score 1 · Answer 1 · answered Jan 13 '23 at 14:52

I think you should use PyMuPDF for PDF handling - it has all batteries included for your task (and many more!).

First thing to clarify:

What type of equality are you looking for? If just number of pages must be equal and pages should look the same pairwise, is very much different from all object and streams must be identical with the exception of the PDF /ID.

Both comparison types are possible with PyMuPDF. To do the latter comparison, loop through both object number tables and compare them pairwise:

import sys
import fitz  # import package PyMuPDF
doc1 = fitz.open("file1.pdf")
xrefs1 = doc1.xref_length()  # cross reference table 1
doc2 = fitz.open("file2.pdf")
xrefs2 = doc2.xref_length()  # cross reference table 2
if xref1 != xref2:
    sys.exit("Files are not equal")  # quick exit
for xref in range(1, xrefs1):  # loop over objects, index 0 must be skipped
    # compare the PDF object definition sources
    if doc1.xref_object(xref) != doc2.xref_object(xref):
        sys.exit(f"Files differ at xref {xref}.")
    if doc1.xref_is_stream(xref):  # compare binary streams
        stream1 = doc1.xref_stream_raw(xref)  # read binary stream
        try:
            stream2 = doc2.xref_stream_raw(xref)  # read binary stream
        except:  # stream extraction doc2 did not work!
            sys.exit(f"stream discrepancy at xref {xref}")
        if stream1 != stream2:
            sys.exit(f"stream discrepancy at xref {xref}")
sys.exit("Files are equal!")

This still is a rather strict equality check: For example, if any date or time in the document metadata has changed, you would report inequality even if the rest is equal.

But there is help: Determine the xref of the metadata and exclude it from the above loop:

info1 = doc1.xref_get_key(-1, "Info")  # extract the info object
info2 = doc2.xref_get_key(-1, "Info")
if info1 != info2:
    sys.exit("Unequal info objects")
if info1[0] == "xref"  # is there metadata at all?
    info_xref1 = int(info1[1].split()[0])  # xref of info object doc1
    info_xref2 = int(info2[1].split()[0])  # xref of info object doc1
    # make another equality here
    # in above loop skip if xref == info_xref1.
else:
    info_xref1 = 0  # 0 is never an xref number, so can safely be used in loop

Thank you very much for your answer. Unfortunately, it didn't work out for me because "info_xref1" only returns integers in my case, while I'm looking for a way to extract the timestamps. — Hagbard, Jan 16 '23 at 10:39
@Hagbard you said want to ignore timestamps? So what do you still need them for? My suggestion shows how to **ignore** any metadata info (where at least some of the PDF timestamps are). Of course `info_xref1` is an integer! That is the intention. In the same way you can also exclude any XML metadata comparison if that is what you want. — Jorj McKie, Jan 17 '23 at 11:40
I apologize for being unclear. I've just edited my initial question again and provided some further details to (hopefully) clarify my issue. — Hagbard, Jan 17 '23 at 12:55

K J · Answer 2 · 2023-01-17T22:25:24.400

Command line/ GUI pdf differs have been around a long time and many PDF difference tools available, like this cross platform one ( https://github.com/vslavik/diff-pdf) are available as both CLI and executable GUI, so best of both worlds.

By default, its only output is its return code, which is 0 if there are no differences and 1 if the two PDFs differ. If given the --output-diff option, it produces a PDF file with visually highlighted differences:

Others more specifically built for cross platform python tend to separate text differences 2 ways so you could try https://github.com/JoshData/pdf-diff, or for graphically there is https://github.com/bgeron/diff-pdf-visually

So by way of example for above dual purpose diff-pdf text you can quickly parse a folder to collect the true false report by run compare blind in pairs then as a result do final one by one compare as visual by shell out to:-

diff-pdf --view a.pdf b.pdf

note this is version 0.4 but 0.5 is available.

Sadly if all 100 are similar by simple compare then all need text testing thus you need a fast binary test batch file to run APPROX 4,950 (99x100/2) fast tests.

test 1.pdf 2.pdf report  
test 1.pdf 3.pdf report  
...  
test 1.pdf 100.pdf report  
test 2.pdf 3.pdf report  
test 2.pdf 4.pdf report
...
test 98.pdf 99.pdf report
test 98.pdf 100.pdf report
test 99.pdf 100.pdf report

then filter the similar ones out and visually inspect much lower number remaining as reported not matched.

so if 49 = 30 = 1 and 60 = 45 = 25 = 2 but not others then there is only the 1 and 2 to look at closer. Of course there will likely be more and you can use a second opinion on those too.

If you know a likely page number that changes you can exclusively test images of say 3rd page that has a date or other identifying feature.

Thank you very much for this very detailed answer. It helped a lot to steer me in the right direction. — Hagbard, Jan 19 '23 at 11:42

Comparing PDF files with varying degrees of strictness

2 Answers2