We have consolidated a couple of drives and NAS systems of a former colleague before the disks failed (some actually showed signs of degradation). Now, we know that the colleague has done "RAID by hand" aka "copying stuff between the disks"... and have a data set of 16TB as a result, with each disk been dumped into a separate directory on a RAID5-backed NAS.
So I went ahead and let fdupes
run on the data, showing a whopping 9TB of duplicates in maybe 1M files total. Problem is: it did not output a list, and lots of the duplicate files actually are legit dupes (e.g. font assets copied over and over between projects). Is there any commandline software tool (that's important, as for performance reasons I have to run it directly on the NAS via ssh access) which can help identify entire trees of directories that are dupe?
To make stuff worse: some of the data came from HFS+ Apple disks and some from an old Linux-based NAS, accessed via SMB/CIFS from Macs. While filename encoding looks fine, the NAS-sourced data dump has shitloads of .AppleDouble
files. So, the tool should be able to ignore all the Apple-related stuff (Spotlight, resource forks, thumbnails).