-1

My users tend to save tons of duplicate files what consumes more and more space and generate HW and archiving cost.

Im thinking to create some scheduled job, to:

  1. find duplicate files (check file MD5 sum, not only filename / size)
  2. leave only 1 original file
  3. replace other redundant copies by link (shortcut) to file (point above)

Any idea how to archive that?

Script / tool / tips ?

EDIT 28.10.2021

Ive found in the meantime findDupe: https://www.sentex.ca/~mwandel/finddupe/

It allows to create hardlinks to original files. Ive tried this - it shows correctly what is duplicated, seems creating hardlinks - but... I cant see difference in HDD usage stats after all

Why that? Can it be Windows calculates free space incorrectly ?

Maciej
  • 123
  • 6

2 Answers2

1

I made a small script in python who answer your needs.

It use fdupes -r <dir> in order to get all duplicates files (even with different names). After that, it iterate over the output and delete duplicated files, then make a symbolic link.

I let you uncomment the two os.system() lines in order to enable the modifications.

Maybe you'll want to pass parameter to this script (like a path or other), I let you search for this need :)

import os

root_dir='/home/user/directory'

blocks_of_dup_files = os.popen('fdupes -r ' + root_dir).read().split('\n\n')

if(blocks_of_dup_files[-1] == '') :
    blocks_of_dup_files.pop()


for files in blocks_of_dup_files:
    files = files.split('\n')
    keeped_file = files.pop()
    for file in files:
        print('rm -f ' + file)
        print('ln -s ' + keeped_file + ' ' + file)

        #os.system('rm -f ' + file)
        #os.system('ln -s ' + keeped_file + ' ' + file)

Martin
  • 474
  • 1
  • 11
  • Thanks. It seems your solution is addressed for Linux. I need something like that for Windows (sorry I forgot to mention that in my post - corrected) – Maciej Oct 27 '21 at 12:23
  • Ok Ive found this can be installed on Windows - via Choco. Will give it try – Maciej Oct 28 '21 at 10:05
1

For Windows I authord https://github.com/Caspeco/BlobBackup/tree/master/DuplicateFinder

You will need visual studio to compile the code. Note tho, that with links if one "file" is modified, then all are (or rather, there is only one file). That could be unwanted behaviour for users.

NiKiZe
  • 1,246
  • 8
  • 20
  • Thanks for sharing this, Ive compiled that but I cant find any info re cmd line params, how it works, etc I did quick check adding one param (directory to scan), it returned: Duplicate done 4 items traversed in xxx - but no ifo if duplicates found (there are some), also no info re (hard) linking – Maciej Oct 28 '21 at 10:04
  • It matches on size and checksums (only testing if duplicates are found) if files are already linked they are "skipped" if duplicates are found it will print it. – NiKiZe Oct 28 '21 at 11:27