22

I need to find every duplicate filenames in a given dir tree. I dont know, what dir tree user will give as a script argument, so I dont know the directory hierarchy. I tried this:

#!/bin/sh
find -type f | while IFS= read vo
do
echo `basename "$vo"`
done

but thats not really what I want. It finds only one duplicate and then ends, even, if there are more duplicate filenames, also - it doesnt print a whole path (prints only a filename) and duplicate count. I wanted to do something similar to this command:

find DIRNAME | tr '[A-Z]' '[a-z]' | sort | uniq -c | grep -v " 1 " 

but it doenst work for me, dont know why. Even if I have a duplicates, it prints nothing. I use Xubuntu 12.04.

yak
  • 3,770
  • 19
  • 60
  • 111

8 Answers8

24

Here is another solution (based on the suggestion by @jim-mcnamara) without awk:

Solution 1

#!/bin/sh 
dirname=/path/to/directory
find $dirname -type f | sed 's_.*/__' | sort|  uniq -d| 
while read fileName
do
find $dirname -type f | grep "$fileName"
done

However, you have to do the same search twice. This can become very slow if you have to search a lot of data. Saving the "find" results in a temporary file might give a better performance.

Solution 2 (with temporary file)

#!/bin/sh 
dirname=/path/to/directory
tempfile=myTempfileName
find $dirname -type f  > $tempfile
cat $tempfile | sed 's_.*/__' | sort |  uniq -d| 
while read fileName
do
 grep "/$fileName" $tempfile
done
#rm -f $tempfile

Since you might not want to write a temp file on the harddrive in some cases, you can choose the method which fits your needs. Both examples print out the full path of the file.

Bonus question here: Is it possible to save the whole output of the find command as a list to a variable?

Katie Kilian
  • 6,815
  • 5
  • 41
  • 64
psibar
  • 1,910
  • 1
  • 12
  • 17
  • You can use grep -f to get rid of the while and simplify it a bit: cat $tempfile | sed 's_.*/__' | sort | uniq -d| grep -f $tempfile – A. Wilson Oct 22 '13 at 20:20
  • 1
    Minor error in Solution 1 may lead to false positives. You'd better write the last find as: find $dirname -type f | grep "^${fileName}$" – prinzdezibel Mar 18 '15 at 13:09
  • How could I change solution 2 so that the first file found is not added to the temporary file, only the duplicates that are found second? – user3746428 Nov 17 '15 at 11:55
  • MacOs : find: -printf: unknown primary or operator – Charaf Oct 18 '18 at 01:50
23

Yes this is a really old question. But all those loops and temporary files seem a bit cumbersome.

Here's my 1-line answer:

find /PATH/TO/FILES -type f -printf '%p/ %f\n' | sort -k2 | uniq -f1 --all-repeated=separate

It has its limitations due to uniq and sort:

  • no whitespace (space, tab) in filename (will be interpreted as new field by uniq and sort)
  • needs file name printed as last field delimited by space (uniq doesn't support comparing only 1 field and is inflexible with field delimiters)

But it is quite flexible regarding its output thanks to find -printf and works well for me. Also seems to be what @yak tried to achieve originally.

Demonstrating some of the options you have with this:

find  /PATH/TO/FILES -type f -printf 'size: %s bytes, modified at: %t, path: %h/, file name: %f\n' | sort -k15 | uniq -f14 --all-repeated=prepend

Also there are options in sort and uniq to ignore case (as the topic opener intended to achieve by piping through tr). Look them up using man uniq or man sort.

trs
  • 1,038
  • 9
  • 19
  • `/usr/share/fslint/fslint/findsn /path/to/files` But I like your one-liner better for its flexibility. – Linulin Mar 24 '18 at 00:41
8
#!/bin/sh
dirname=/path/to/check
find $dirname -type f | 
while read vo
do
  echo `basename "$vo"`
done | awk '{arr[$0]++; next} END{for (i in arr){if(arr[i]>1){print i}}}  
jim mcnamara
  • 16,005
  • 2
  • 34
  • 51
  • Is it possible to make it without `awk`? Thanks anyway :) – yak Apr 29 '13 at 11:31
  • You can do it with any language that supports associative arrays (or hashing is another name) - perl is an example. bash 4 has support for associative arrays as well. – jim mcnamara Apr 29 '13 at 11:33
  • So you say that only-bash solution isnt possible? I mean, without sed, awk, perl, python, etc. just pure bash? – yak Apr 29 '13 at 11:36
  • 2
    by the way, this solution only tells you the filename, without the path where they are. I thought that was a requirement – Elisiano Petrini Apr 29 '13 at 11:45
  • @ElisianoPetrini: ops, thanks, you're right. I need a full path. Question is open again. – yak Apr 29 '13 at 11:49
2
#!/bin/bash

file=`mktemp /tmp/duplicates.XXXXX` || { echo "Error creating tmp file"; exit 1; }
find $1 -type f |sort >  $file
awk -F/ '{print tolower($NF)}' $file |
        uniq -c|
        awk '$1>1 { sub(/^[[:space:]]+[[:digit:]]+[[:space:]]+/,""); print }'| 
        while read line;
                do grep -i "$line" $file;
        done

rm $file

And it also work with spaces in filenames. Here's a simple test (the first argument is the directory):

./duplicates.sh ./test
./test/2/INC 255286
./test/INC 255286
2

One "find" command only:

lst=$( find . -type f )
echo "$lst" | rev | cut -f 1 -d/ | rev | sort -f | uniq -i | while read f; do
   names=$( echo "$lst" | grep -i -- "/$f$" )
   n=$( echo "$names" | wc -l )
   [ $n -gt 1 ] && echo -e "Duplicates found ($n):\n$names"
done
Fabien Bouleau
  • 464
  • 3
  • 11
0

This solution writes one temporary file to a temporary directory for every unique filename found. In the temporary file, I write the path where I first found the unique filename, so that I can output it later. So, I create a lot more files that other posted solutions. But, it was something I could understand.

Following is the script, named fndupe.

#!/bin/bash

# Create a temp directory to contain placeholder files.
tmp_dir=`mktemp -d`

# Get paths of files to test from standard input.
while read p; do
  fname=$(basename "$p")
  tmp_path=$tmp_dir/$fname
  if [[ -e $tmp_path ]]; then
    q=`cat "$tmp_path"`
    echo "duplicate: $p"
    echo "    first: $q"
  else
    echo $p > "$tmp_path" 
  fi
done

exit

Following is an example of using the script.

$ find . -name '*.tif' | fndupe

Following is example output when the script finds duplicate filenames.

duplicate: a/b/extra/gobble.tif
    first: a/b/gobble.tif

Tested with Bash version: GNU bash, version 4.1.2(1)-release (x86_64-redhat-linux-gnu)

Mike Finch
  • 746
  • 1
  • 7
  • 20
0

Here is my contribution (this just searches for a specific file type, pdfs in this case) but it does so recursively:

#!/usr/bin/env bash

find . -type f | while read filename; do
    filename=$(basename -- "$filename")
    extension="${filename##*.}"
    if [[ $extension == "pdf" ]]; then
        fileNameCount=`find . -iname "$filename" | wc -l`
        if [[ $fileNameCount -gt 1 ]]; then
            echo "File Name: $filename, count: $fileNameCount"
        fi
    fi
done
0

Just stumbled upon this interesting case lately. Sharing my solution here even the question is long outdated.

Using join , no grep,awk,python,sed,perl,etc. :

#!/bin/sh
list=$(mktemp)
find PATH/TO/DIR/ -type f -printf '%f\t%p\n' | sort -f >$list
cut -d\^I -f1 <$list | uniq -d -i | join -i -t\^I - $list
rm $list

Quick notes:

  • ^I in command above stands for the tab character. Replace in actual command.
  • Spaces in file names are supported.
  • File names must not contain tabs or newlines characters.
  • Performance seems very good. Tested on a large directory tree with several thousand files and the results were almost instant.
  • The comparisons are done case-insensitively. Case-sensitiveness can be achieved by removal of sort "-f" and uniq+join "-i" options.

Example:

Directory tree:

a/f1
a/f2
a/f3
b/f2
c/f2
c/f3

Output:

f2  a/f2
f2  b/f2
f2  c/f2
f3  a/f3
f3  c/f3