PDF File dedupe issue with same content, but generated at different time periods from a docx

Question

I working on a pdf file dedupe project and analyzed many libraries in python, which read files, then generate hash value of it and then compare it with the next file for duplication - similar to logic below or using python filecomp lib. But the issue I found these logic is like, if a pdf is generated from a source DOCX(Save to PDF) , those outputs are not considered duplicates - even content is exactly the same. Why this happens? Is there any other logic to read the content, then create a unique hash value based on the actual content.

def calculate_hash_val(path, blocks=65536):
file = open(path, 'rb')
hasher = hashlib.md5()
data = file.read()    
while len(data) > 0:
    hasher.update(data)
    data = file.read()
file.close()
return hasher.hexdigest()

score 1 · Accepted Answer · answered Oct 22 '22 at 05:00

1

One of the things that happens is that you save metadata to the file including the time of creation. It is invisible in the PDF, but that will make the hash different.

Here is an explanation of how to find and strip out that data with at least one tool. I am sure that there are many others.

answered Oct 22 '22 at 05:00

btilly

43,296
3
59
88

@KJ Unless there is a way to strip and normalize the document, you are right. But if the PDF is produced in a deterministic way from the same word document, then there is a really good chance that this works. – btilly Oct 22 '22 at 18:40
@KJ In other words you're agreeing with me? – btilly Oct 22 '22 at 23:07

PDF File dedupe issue with same content, but generated at different time periods from a docx

1 Answers1