I working on a pdf file dedupe project and analyzed many libraries in python, which read files, then generate hash value of it and then compare it with the next file for duplication - similar to logic below or using python filecomp lib. But the issue I found these logic is like, if a pdf is generated from a source DOCX(Save to PDF) , those outputs are not considered duplicates - even content is exactly the same. Why this happens? Is there any other logic to read the content, then create a unique hash value based on the actual content.
def calculate_hash_val(path, blocks=65536):
file = open(path, 'rb')
hasher = hashlib.md5()
data = file.read()
while len(data) > 0:
hasher.update(data)
data = file.read()
file.close()
return hasher.hexdigest()