Deduplication assistance

Question

We have consolidated a couple of drives and NAS systems of a former colleague before the disks failed (some actually showed signs of degradation). Now, we know that the colleague has done "RAID by hand" aka "copying stuff between the disks"... and have a data set of 16TB as a result, with each disk been dumped into a separate directory on a RAID5-backed NAS.

So I went ahead and let fdupes run on the data, showing a whopping 9TB of duplicates in maybe 1M files total. Problem is: it did not output a list, and lots of the duplicate files actually are legit dupes (e.g. font assets copied over and over between projects). Is there any commandline software tool (that's important, as for performance reasons I have to run it directly on the NAS via ssh access) which can help identify entire trees of directories that are dupe?

To make stuff worse: some of the data came from HFS+ Apple disks and some from an old Linux-based NAS, accessed via SMB/CIFS from Macs. While filename encoding looks fine, the NAS-sourced data dump has shitloads of .AppleDouble files. So, the tool should be able to ignore all the Apple-related stuff (Spotlight, resource forks, thumbnails).

It's not a complete solution, but using diff -r (or another diff tool) can easily tell you if a pair of directories are actually identical. — Michael Kohne, Jun 29 '18 at 12:50
[rdfind](https://www.systutorials.com/docs/linux/man/1-rdfind/) isn't a complete solution, but it could be helpful. In fact, if you just set it to identify identical files and hardlink them, it would at least save you the disk space involved, though it wouldn't make the mess any easier to look through. — Michael Kohne, Jun 29 '18 at 12:52
Hmm. Will this also help when stuff is duplicated in the same tree? For example, when the directory /a/b/c/d would be duplicated both as /a/b/e/d and /f/g/h/d? — Marco Schuster, Jun 29 '18 at 13:31

Deduplication assistance

0 Answers0