Compare md5 of all files in directory excluding multiple hardlinks

Question

I tend to ramble, so I apologise in advance if a bid to cut the chaff leads to less context (or I just fail miserably and ramble nonetheless).

I'm trying to improve some tools I wrote for rsyncing a large amount of data from one network storage location to another for archiving purposes (2nd network location is part of a much larger tape library system). Due to a large number of shared assets there are usually a large number of hard-linked files in the directories to move, and I use rsync to preserve those links.

Rsyncing in the region of 1TB of actual data that when hard-links are 'included' into the total can be 4 or 5 times bigger (ie 4 - 5TB) is not uncommon, or unexpected.

For various reasons, I need to hash the data in the source and compare to the destination data AND keep a record of that hash results (inc. hash). This is so if restored data is unexpectedly corrupt I can compare the hash of the restored data and the hash of the same file when it was originally rsynced to pinpoint when / if the corruption occurred.

After the rsync has happened, I use the following to md5 the source (any hash would do, but I chose md5 for no specific reason):

find . -type f -exec md5sum "{}" + > $temp_file

The output of $temp_file is echo'd into my main output file as well. Then move to the destination and run (its done that way, source first then destination, as if folders are being merged, it will only hash the files moved in this latest rsync):

md5sum -c $temp_file >> $output_file

All is well and good, and this does work EXCEPT, this will hash all the files, including hard-links, in effect, finding the md5 hash of the same files over and over again, which can add hours to the process overall.

Is there a way to edit the 'find....' command to ignore hardlinked files, BUT still hash the 'original' file from which the hard-links actually point to. I did look into the following:

find . -type f -links 1

But my concern is that ALL hard-link related files will be ignored, rather than listing the 'original' file that actually occupies the inode, and excluding all the files that subsequently point to that inode.

Am I right about -links 1 ignoring all hard-link related files, and if so, what can I do?

You could write a memoization script that builds an inode → hash cache to avoid recomputing duplicates. — John Kugelman, Jul 20 '20 at 19:12
All regular files are hard links. If an inode has multiple hard links, you can't tell which link was created first. You can memoize the hashes by e.g. having find output inode number and filename, read them in a `while read` loop, and use an associative array to keep track of whether you've already processed this inode — that other guy, Jul 20 '20 at 19:25
Thank you both, this info was most helpful, and also exposed some areas I really don't understand (ie. All regular files are hard links) and I'm grateful for your information!! — Owen Morgan, Jul 21 '20 at 14:06

score 1 · Accepted Answer · answered Jul 20 '20 at 19:58

1

Unlike softlinks, hardlinks are regular files, each points to same inode number and conceptually there are no original or duplicate hardlinks.

What you can do here is to use -samefile with find command to get all the same hardlinks, put into the ignorelist, and use this ignorelist to skip operation on duplicate.

touch /tmp/duplicates
find . -type f | while read f
do
    if ! $(grep $f /tmp/duplicates &>/dev/null)
    then
        find . -samefile $f | grep -v $f >> /tmp/duplicates
        # put md5sum procedure for $f here
    fi
done

answered Jul 20 '20 at 19:58

initanmol

340
1
4

Thank you!! This is clearly a case where my lack of 'in-depth' understanding about how filesystems work is showing. This is an nice solution (that also fits my style of coding using temp files for various stuff - its not that I don't like arrays, just their bash syntax is imo clunky and hard to read!). – Owen Morgan Jul 21 '20 at 14:02

dash-o · Answer 2 · 2020-07-21T18:20:45.897

1

As an alternative to comparing each and every file with the list of processed files, consider using the inode (as suggested by commenters). Depending on the number of files in the tree, it might save time by removing he repeated 'find' over the tree.

#! /bin/bash

declare -A seen
find . -type f -printf '%i %p\n'  | while read inode file ; do
   [ "${seen[$inode]}" ] && continue
    seen[$inode]=$file
    # MD5 calculation ...
    md5sum $file
    ...
done

edited Jul 21 '20 at 18:20

answered Jul 21 '20 at 05:06

dash-o

13,723
1
10
37

Thank you! I accepted anmol's as the answer because his snippet fits my coding style, but I can see how this can be quicker with huge tree's, which does occur, so I will test this as well for speed :) Thank you for taking the time to respond! – Owen Morgan Jul 21 '20 at 14:04
sorry to double comment. I would just add to get this to work I had to modify the `find` to be `find . -type f -exec ls -i "{}" +` to make the pipe output the inode and filename for the `while read` loop. – Owen Morgan Jul 21 '20 at 17:23
1

@OwenMorgan I've fixed the command line. I missed it when I copied merge my solution into the original post code. Should work without the `exec` which will be a big performance penalty. – dash-o Jul 21 '20 at 18:22

Compare md5 of all files in directory excluding multiple hardlinks

2 Answers2