2

I have a large number of files that occasionally have duplicates that have different names and would like to add a fslint like capability to the file system so that it can be de-duplicated and then any new files created in specified locations be checked against known md5 values. The intent being that after the initial summing of the entire file collection the overhead is smaller as it then only requires the new file's md5 sum to be compared against the store of existing sums. This check could be a daily job or as part of a file submission process.

checkandsave -f newfile -d destination

Does this utility already exist? What is the best way of storing fileid-md4sum pairs to that the search on a new file's sum is as fast as possible?

r.e. Using rmlink:

Where does rmlink store the checksums, or is that work repeated every run? I want to add the checksum to the file metadata (or some form of store that optimises search speed) so that when I have a new file I generate it's sum and check it against the existing pre-calculated sums, for all files of the same size.

1 Answers1

0

Yes rmlint can do this via the --xattr-read --xattr-write options.

The cron job would be something like:

/usr/bin/rmlint -T df -o sh:/home/foo/dupes.sh -c sh:link --xattr-read --xattr-write /path/to/files

-T df means just look for duplicate files

-o sh:/home/foo/newdupes.sh specifies where to put the output report / shell script (if you want one)

-c sh:link specifies that the shell script should replace duplicates with hardlinks or symlinks (or reflinks on btrfs)

Note that rmlint only calculates file checksums when necessary, for example if there is only one file with a given size then there is no chance of a duplicate so no checksum is calculated.

Edit: the checksums are stored in the file extended attributes metadata. The default uses SHA1 but you can switch this to md5 via -a md5

thomas_d_j
  • 111
  • 2