2

There are a couple of duplicate file finders for Linux listed e.g. here. I have already tried fdupes and fslint. However, from what I have seen, these will find all duplicates of the selected directory-structures/search paths and thus also duplicates that exist inside only one of the search-paths (if you select multiple).

However what I need/want is to search for duplicates against a reference path, where I can define one path to be the reference path, and search inside the other path, for files that exist in the reference path in order to remove them.

I need to do this, to prepare two large directory-structure that have gotten out of sync, where one is more up-to-date than the other (this would be my reference). Most of the files should be duplicates between the two, but I suspect, that there are still some files only on the other path, so that I don't want to just remove it.

Are there perhaps some options to fdupes to achieve this, that I have overlooked?

I have tried writing a Python script to clean up the list that fdupes outputs, but not with success.

packoman
  • 175
  • 1
  • 2
  • 8
  • The duplicates and the "original" files all have the same name and so on? Only last write time / creation time differs? Or do you compare the files with meta information? – Lenniey Jul 21 '17 at 07:21
  • @Lenniey Almost all files should have the same filenames, but I cannot be 100% sure. I would prefer having a true binary comparison between files (if you are aiming at how to check for the duplicates). If two files have the same name and one of the is newer, I would prefer not to delete the new file (althought technically, they would not be duplicates then). I am not sure what you mean by meta information(?). I should mention that I want to this in preparation for merging the two directory-structures. – packoman Jul 21 '17 at 07:27
  • I meant [metadata](https://en.wikipedia.org/wiki/Metadata). To only sync files / folders by date and filename, `rsync` is quite enough. For binary / metadata comparison it won't suit you, though. You would need `cmp` or `diff`(or something similar). – Lenniey Jul 21 '17 at 07:29

1 Answers1

2

rmlint can do this:

rmlint --types=duplicates --must-match-tagged --keep-all-tagged <path1> // <path2>

This will find files in path1 which have duplicates (same data content) in path2. It will create a shell script which, if run, will remove the duplicates under path1, leaving only the unique files.

thomas_d_j
  • 121
  • 1