How to find duplicate files against a reference directory structure in Linux

Question

There are a couple of duplicate file finders for Linux listed e.g. here. I have already tried fdupes and fslint. However, from what I have seen, these will find all duplicates of the selected directory-structures/search paths and thus also duplicates that exist inside only one of the search-paths (if you select multiple).

However what I need/want is to search for duplicates against a reference path, where I can define one path to be the reference path, and search inside the other path, for files that exist in the reference path in order to remove them.

I need to do this, to prepare two large directory-structure that have gotten out of sync, where one is more up-to-date than the other (this would be my reference). Most of the files should be duplicates between the two, but I suspect, that there are still some files only on the other path, so that I don't want to just remove it.

Are there perhaps some options to fdupes to achieve this, that I have overlooked?

I have tried writing a Python script to clean up the list that fdupes outputs, but not with success.

The duplicates and the "original" files all have the same name and so on? Only last write time / creation time differs? Or do you compare the files with meta information? — Lenniey, Jul 21 '17 at 07:21
@Lenniey Almost all files should have the same filenames, but I cannot be 100% sure. I would prefer having a true binary comparison between files (if you are aiming at how to check for the duplicates). If two files have the same name and one of the is newer, I would prefer not to delete the new file (althought technically, they would not be duplicates then). I am not sure what you mean by meta information(?). I should mention that I want to this in preparation for merging the two directory-structures. — packoman, Jul 21 '17 at 07:27
I meant [metadata](https://en.wikipedia.org/wiki/Metadata). To only sync files / folders by date and filename, `rsync` is quite enough. For binary / metadata comparison it won't suit you, though. You would need `cmp` or `diff`(or something similar). — Lenniey, Jul 21 '17 at 07:29

score 2 · Answer 1 · answered Aug 04 '17 at 07:53

rmlint can do this:

rmlint --types=duplicates --must-match-tagged --keep-all-tagged <path1> // <path2>

This will find files in path1 which have duplicates (same data content) in path2. It will create a shell script which, if run, will remove the duplicates under path1, leaving only the unique files.

How to find duplicate files against a reference directory structure in Linux

1 Answers1