I have a large number of files that occasionally have duplicates that have different names and would like to add a fslint like capability to the file system so that it can be de-duplicated and then any new files created in specified locations be checked against known md5 values. The intent being that after the initial summing of the entire file collection the overhead is smaller as it then only requires the new file's md5 sum to be compared against the store of existing sums. This check could be a daily job or as part of a file submission process.
checkandsave -f newfile -d destination
Does this utility already exist? What is the best way of storing fileid-md4sum pairs to that the search on a new file's sum is as fast as possible?
r.e. Using rmlink:
Where does rmlink store the checksums, or is that work repeated every run? I want to add the checksum to the file metadata (or some form of store that optimises search speed) so that when I have a new file I generate it's sum and check it against the existing pre-calculated sums, for all files of the same size.