find and remove duplicates filenames in directory hierarchy

Question

#!/bin/sh
LASTBASE=""  
find $1 -type f -print | rev | sort | rev | while read FILE
do
    BASE=$(basename "$FILE")
    if [ "$BASE" = "$LASTBASE" ]; then
        rm "$FILE"
    LASTBASE="$BASE"
done

You could insert the `uniq` command into your script/process... — ewwhite, Feb 06 '13 at 19:07
I was going to suggest `uniq -d` and an incantation of `sed` to remove every other line. But not sure how that will fix the white space problem. — Aaron Copley, Feb 06 '13 at 19:17
@ewwhite uniq isn't useful because i need compare basename not full path — stefcud, Feb 07 '13 at 02:27

mgorven · Accepted Answer · 2013-02-06T21:07:43.407

3

If you pipe the output of find into a while read loop you can deal with them line by line:

find nnn/ -type f -print | rev | sort | rev | while read FILE; do
    ...
done

Edit: So this method does break if filenames contain double (consecutive) spaces, because read actually splits the line up according to $IFS and then joins it again when storing the last variable. To address this you could temporarily change $IFS to disable splitting:

OIFS="$IFS"
IFS=""
find | while read...
IFS="$OIFS"

Edit: test (which is the same as [) doesn't have a == operator, you just want =.

edited Feb 06 '13 at 21:07

answered Feb 06 '13 at 19:05

mgorven

30,615
7
79
122

1

I don't think that this will work with spaces in filenames as was in the question. – mdpc Feb 06 '13 at 19:12
1

@mdpc It does work, but does break if there are double spaces. – mgorven Feb 06 '13 at 19:18
yes unfortunately i have more filenames with multiple spaces! :( – stefcud Feb 06 '13 at 19:39
@StefanoCudini See my edit for another solution. – mgorven Feb 06 '13 at 19:40
@mgorven look my edit! im using while, now $BASE have right value but im receive "unexpected operator" in each items – stefcud Feb 06 '13 at 20:17
@StefanoCudini Change `==` to `=`. – mgorven Feb 06 '13 at 21:07
ok I have corrected the == error. now there are no errors bash, but if the condition comparison is always false!! – stefcud Feb 06 '13 at 21:34

Karma Fusebox · Answer 2 · 2013-02-07T02:24:34.560

2

I just found this "gem" in an old bash history and it, well, actually works without stumbling over whitespaces in filenames.

Content-wise Comparison

for hash in `find . -exec md5sum {} \; 2>/dev/null | sort | awk '{ print $1 }' | uniq -d`; do 
     find . -exec md5sum {} \; 2>/dev/null | grep $hash | awk '{print $2 }'; 
done;

informal:

First line: traverse the directory tree and calculate the md5sum of all files below, sort this output (format: hash filename), grab the hash column, reduce it to doubled values. (means there are duplicates)
Second line: for every one of the double-occuring hashes, traverse again and print the filename if the current file has the current hash (means the file is one of multiple)

example output:

./aFile
./aFolder/aFile
./1000digitsOfPI
./a/b/c/thousanddigitsofPI
./b File
./bFolder/cFolder/b File

Removing is not implemented here because it might be hard to decide which version of the doubled files you want to keep.

Filename-wise Comparison

If you just want to look at filenames and not at contents, it gets even easier:

for name in `find . -type f -printf "%f\n" | sort | uniq -d`; do 
    find . -name $name; 
done;

Update: Unfortunately this version is breaking with whitespaces in filenames again.

edited Feb 07 '13 at 02:24

answered Feb 06 '13 at 21:43

Karma Fusebox

1,114
9
18

this code is very interesting, but unfortunately I can not run a md5 because the files are very large and server resources is tiny. In my case I am aware that files with the same name also have the same content how can I modify your code to do a background check on name only? – stefcud Feb 07 '13 at 01:17
1

Oh, in that case it's not a wtf-gem anymore, just an ordinary *find*. This textbox doesn't like the long line, I'll edit it into the answer. – Karma Fusebox Feb 07 '13 at 01:32
There you go... – Karma Fusebox Feb 07 '13 at 01:40
in title i wrote wrote "duplicates filenames" not "duplicates files" anyway thanks for your code is very useful the same – stefcud Feb 07 '13 at 02:05
I know, all I wanted was to paste some quirky old code that luckily might work for you, even if it does not match your exact request. ;) I have other bad news though. As I'm playing around with it, I see that the filename-comparison suffers from the whitespaces in names again. ARGH. Sorry, don't think it can be done this way. – Karma Fusebox Feb 07 '13 at 02:20
view my last edit, im found complete solution(also for white spaces in names) try suggestions of @mgorven – stefcud Feb 07 '13 at 02:25

score 1 · Answer 3 · answered Feb 06 '13 at 19:30

1

The problem lies in this line of code for FILE in $FILES; do - the for loop is assigning the FILE variable based on the white space separator. So if a file has one or more whitespaces then it won't work. Simply change the default IFS from space to new line or tab. If I remember correctly you can set IFS in bash using something like this -

IFS=$'\n'

answered Feb 06 '13 at 19:30

Daniel t.

9,291
1
33
36

sorry but I do not think understand what you mean – stefcud Feb 06 '13 at 19:41
did you try addin `IFS=$'\n'` before the for loop line? – Daniel t. Feb 06 '13 at 19:55

find and remove duplicates filenames in directory hierarchy

3 Answers3

Content-wise Comparison

Filename-wise Comparison