I have two folders, each including ca. 100 PDF files resulting from different runs of the same PDF generation program. After performing some changes to this program, the resulting PDF should always stay equal and nothing should break the layout, the fonts, any potential graphs and so on. This is why I would like to check for visual equality while ignoring any metadata that might have changed due to running the program at different times.
My first approach was based on this post and attempted to compare the hashes of each file:
h1 = hashlib.sha1()
h2 = hashlib.sha1()
with open(fileName1, "rb") as file:
chunk = 0
while chunk != b'':
chunk = file.read(1024)
h1.update(chunk)
with open(fileName2, "rb") as file:
chunk = 0
while chunk != b'':
chunk = file.read(1024)
h2.update(chunk)
return (h1.hexdigest() == h2.hexdigest())
This always returns "False". I assume that this is due to different time dependent metadata, which is why I would like to ignore them. I've already found a way to set the modification and creation data to "None":
pdf1 = pdfrw.PdfReader(fileName1)
pdf1.Info.ModDate = pdf1.Info.CreationDate = None
pdfrw.PdfWriter().write(fileName1, pdf1)
pdf2 = pdfrw.PdfReader(fileName2)
pdf2.Info.ModDate = pdf2.Info.CreationDate = None
pdfrw.PdfWriter().write(fileName2, pdf2)
Looping through all files in each folder and running the second method before the first curiously sometimes results in a return value of "True" and sometimes in a return value of "False".
Thanks to the kind help of @jorj-mckie (see answer below), I've the following methods checking for xref equality:
doc1 = fitz.open(fileName1)
xrefs1 = doc1.xref_length() # cross reference table 1
doc2 = fitz.open(fileName2)
xrefs2 = doc2.xref_length() # cross reference table 2
if (xrefs1 != xrefs2):
print("Files are not equal")
return False
for xref in range(1, xrefs1): # loop over objects, index 0 must be skipped
# compare the PDF object definition sources
if (doc1.xref_object(xref) != doc2.xref_object(xref)):
print(f"Files differ at xref {xref}.")
return False
if doc1.xref_is_stream(xref): # compare binary streams
stream1 = doc1.xref_stream_raw(xref) # read binary stream
try:
stream2 = doc2.xref_stream_raw(xref) # read binary stream
except: # stream extraction doc2 did not work!
print(f"stream discrepancy at xref {xref}")
return False
if (stream1 != stream2):
print(f"stream discrepancy at xref {xref}")
return False
return True
and xref equality without metadata:
doc1 = fitz.open(fileName1)
xrefs1 = doc1.xref_length() # cross reference table 1
doc2 = fitz.open(fileName2)
xrefs2 = doc2.xref_length() # cross reference table 2
info1 = doc1.xref_get_key(-1, "Info") # extract the info object
info2 = doc2.xref_get_key(-1, "Info")
if (info1 != info2):
print("Unequal info objects")
return False
if (info1[0] == "xref"): # is there metadata at all?
info_xref1 = int(info1[1].split()[0]) # xref of info object doc1
info_xref2 = int(info2[1].split()[0]) # xref of info object doc1
else:
info_xref1 = 0
for xref in range(1, xrefs1): # loop over objects, index 0 must be skipped
# compare the PDF object definition sources
if (xref != info_xref1):
if (doc1.xref_object(xref) != doc2.xref_object(xref)):
print(f"Files differ at xref {xref}.")
return False
if doc1.xref_is_stream(xref): # compare binary streams
stream1 = doc1.xref_stream_raw(xref) # read binary stream
try:
stream2 = doc2.xref_stream_raw(xref) # read binary stream
except: # stream extraction doc2 did not work!
print(f"stream discrepancy at xref {xref}")
return False
if (stream1 != stream2):
print(f"stream discrepancy at xref {xref}")
return False
return True
If I run the last two functions on my PDF files, whose timestamps have already been set to "None" (see above), I end up with some equality checks resulting in a "True" return value and others resulting in "False".
I'm using the reportlab library to generate the PDFs. Do I just have to live with the fact that some PDFs will always have a different internal structure, resulting in different hashes even if the files look exactly the same? I would be very happy to learn that this is not the case and there is indeed a way to check for equality without actually having to export all pages to images first.