How can I remove lines appear only once in a file in bash?
For example, file foo.txt has:
1
2
3
3
4
5
after process the file, only
3
3
will remain.
Note the file is sorted already.
How can I remove lines appear only once in a file in bash?
For example, file foo.txt has:
1
2
3
3
4
5
after process the file, only
3
3
will remain.
Note the file is sorted already.
If your duplicated lines are consecutives, you can use uniq
uniq -D file
from the man pages:
-D print all duplicate lines
Just loop the file twice:
$ awk 'FNR==NR {seen[$0]++; next} seen[$0]>1' file file
3
3
seen[ record ]
keeps track of it as an array.Using single pass awk:
awk '{freq[$0]++} END{for(i in freq) for (j=1; freq[i]>1 && j<=freq[i]; j++) print i}' file
3
3
freq[$0]++
we count and store frequency of each line.END
block if frequency
is greater than 1
then we print those lines as many times as the frequency.Using awk, single pass:
$ awk 'a[$0]++ && a[$0]==2 {print} a[$0]>1' foo.txt
3
3
If the file is unordered, the output will happen in the order duplicates are found in the file due to the solution not buffering values.
Here's a POSIX-compliant awk
alternative to the GNU-specific uniq -D
:
awk '++seen[$0] == 2; seen[$0] >= 2' file
This turned out to be just a shorter reformulation of James Brown's helpful answer.
Unlike uniq
, this command doesn't strictly require the duplicates to be grouped, but the output order will only be predictable if they are.
That is, if the duplicates aren't grouped, the output order is determined by the the relative ordering of the 2nd instances in each set of duplicates, and in each set the 1st and the 2nd instances will be printed together.
For unsorted (ungrouped) data (and if preserving the input order is also important), consider: