How to remove lines appear only once in a file using bash

Question

How can I remove lines appear only once in a file in bash?

For example, file foo.txt has:

after process the file, only

3
3

will remain.

Note the file is sorted already.

If there are numbers `1,3,1,3` is that the order of output or can you handle `1,1,3,3,`? — James Brown, Oct 21 '16 at 11:24

score 6 · Accepted Answer · answered Oct 21 '16 at 11:27

6

If your duplicated lines are consecutives, you can use uniq

uniq -D file

from the man pages:

-D print all duplicate lines

answered Oct 21 '16 at 11:27

oliv

12,690
25
45

2

if the duplicated lines aren't consecutives, you must sort them first `sort file | uniq -D` – Frank Neblung Oct 21 '16 at 11:53
2

Note that `-D` is a _GNU-specific_ extension and won't work with BSD/macOS `uniq`. – mklement0 Oct 21 '16 at 18:47

score 3 · Answer 2 · answered Oct 21 '16 at 11:21

3

Just loop the file twice:

$ awk 'FNR==NR {seen[$0]++; next} seen[$0]>1' file file
3
3

firstly to count how many times a line occurs: seen[ record ] keeps track of it as an array.
secondly to print those that appear more than once

answered Oct 21 '16 at 11:21

fedorqui

275,237
103
548
598

1

This answer will preserve the original order and it will work even if input data is unsorted ++ – anubhava Oct 21 '16 at 14:14

score 2 · Answer 3 · answered Oct 21 '16 at 11:21

2

Using single pass awk:

awk '{freq[$0]++} END{for(i in freq) for (j=1; freq[i]>1 && j<=freq[i]; j++) print i}' file

3
3

Using freq[$0]++ we count and store frequency of each line.
In the END block if frequency is greater than 1 then we print those lines as many times as the frequency.

answered Oct 21 '16 at 11:21

anubhava

761,203
64
569
643

1

nice! I like your [`for-loop` guru approaches](http://stackoverflow.com/a/40110515/1983854) – fedorqui Oct 21 '16 at 12:54

James Brown · Answer 4 · 2016-10-21T13:00:29.783

2

Using awk, single pass:

$ awk 'a[$0]++ && a[$0]==2 {print} a[$0]>1' foo.txt
3
3

If the file is unordered, the output will happen in the order duplicates are found in the file due to the solution not buffering values.

edited Oct 21 '16 at 13:00

answered Oct 21 '16 at 12:26

James Brown

36,089
7
43
59

score 1 · Answer 5 · edited May 23 '17 at 10:32

Here's a POSIX-compliant awk alternative to the GNU-specific uniq -D:

awk '++seen[$0] == 2; seen[$0] >= 2' file

^{This turned out to be just a shorter reformulation of James Brown's helpful answer.}

Unlike uniq, this command doesn't strictly require the duplicates to be grouped, but the output order will only be predictable if they are.

That is, if the duplicates aren't grouped, the output order is determined by the the relative ordering of the 2nd instances in each set of duplicates, and in each set the 1st and the 2nd instances will be printed together.

For unsorted (ungrouped) data (and if preserving the input order is also important), consider:

fedorqui's helpful answer (elegant, but requires reading the file twice)
anubhava's helpful answer (single-pass solution, but a little more cumbersome).

@JamesBrown: Thanks - I didn't actually notice (and up-voted) your answer until after I'd written mine. — mklement0, Oct 21 '16 at 20:32

How to remove lines appear only once in a file using bash

5 Answers5