2

Given a long text file like this one (that we will call file.txt):

EDITED

1 AA
2 ab
3 azd
4 ab
5 AA
6 aslmdkfj
7 AA

How to delete the lines that appear at least twice in the same file in bash? What I mean is that I want to have this result:

1 AA
2 ab
3 azd
6 aslmdkfj

I do not want to have the same lines in double, given a specific text file. Could you show me the command please?

Jens
  • 69,818
  • 15
  • 125
  • 179
user1619114
  • 131
  • 1
  • 2
  • 9

3 Answers3

9

Assuming whitespace is significant, the typical solution is:

awk '!x[$0]++' file.txt

(eg, The line "ab " is not considered the same as "ab". It is probably simplest to pre-process the data if you want to treat whitespace differently.)

--EDIT-- Given the modified question, which I'll interpret as only wanting to check uniqueness after a given column, try something like:

awk '!x[ substr( $0, 2 )]++' file.txt

This will only compare columns 2 through the end of the line, ignoring the first column. This is a typical awk idiom: we are simply building an array named x (one letter variable names are a terrible idea in a script, but are reasonable for a one-liner on the command line) which holds the number of times a given string is seen. The first time it is seen, it is printed. In the first case, we are using the entire input line contained in $0. In the second case we are only using the substring consisting of everything including and after the 2nd character.

William Pursell
  • 204,365
  • 48
  • 270
  • 300
  • 1
    This is the first solution which gives the output which the OP actually posted (duplicates removed but original order preserved.) – DSM Aug 27 '12 at 20:30
  • Hello, I got another question, and I didn't want to open a new question for that so I edited mine. IF you have some content in front of these datas (lire time or lines) and you want to keep them while sorting them out, how can we make it please ? – user1619114 Aug 27 '12 at 21:09
  • Try: `awk '{k=$0; $1=""; if( !x[$0]++ ) print k}' file.txt` – William Pursell Aug 27 '12 at 21:24
  • If I got data like this in front of each line for example : "2012-07-25 21:59:28 1" with different value of hours of course and date. Could you explain me how you have this result please ? – user1619114 Aug 27 '12 at 21:35
  • @user1619114 Perhaps it would help you to read an awk primer. – William Pursell Aug 27 '12 at 21:51
  • Should I use something like `| cut -d" " -f4` for example ? I don't really know how to **not** take into account the 3 first datas into the sorting, if you see what I mean, but I still want to keep it in the final result. – user1619114 Aug 27 '12 at 21:56
  • If you want to ignore the first 3 fields when comparing, you could do as above (in my comment) with `$1=""; $2=""; $3="";`, but I would suggest that this solution is becoming too much of a kludge. – William Pursell Aug 27 '12 at 22:03
  • Perfect, I try this tomorrow morning! Thank you very much! I take your answer as granted for now, in case of problem, I will repost here, if you don't mind. Once again, Thank you very much for all! – user1619114 Aug 27 '12 at 22:07
  • Just keep in mind that for really long files it might take a lot of system resources, and to run very slow, because of the use of very large array. – Eran Ben-Natan Aug 28 '12 at 06:21
8

Try this simple script:

cat file.txt | sort | uniq

cat will output the contents of the file,

sort will put duplicate entries adjacent to each other

uniq will remove adjcacent duplicate entries.

Hope this helps!

cjhveal
  • 5,668
  • 2
  • 28
  • 38
  • And if I got the hour in front of each datas I would like to sort for exemple, separed only by a space, how can I do that please ? – user1619114 Aug 27 '12 at 20:30
  • Hello, I got another question, and I didn't want to open a new question for that so I edited mine. IF you have some content in front of these datas (lire time or lines) and you want to keep them while sorting them out, how can we make it please ? – user1619114 Aug 27 '12 at 21:13
  • is there a way to output _only_ the matches? – galois Aug 24 '15 at 06:07
  • `uniq -d` will print only those lines which are duplicated (assuming as usual that you give it a sorted list). – cjhveal Aug 24 '15 at 10:23
4

The uniq command will do what you want.

But make sure the file is sorted first, it only checks for consecutive lines.

Like this:

sort file.txt | uniq
Ariel
  • 25,995
  • 5
  • 59
  • 69
  • Then how to sort a file first please ? – user1619114 Aug 27 '12 at 20:27
  • Hello, I got another question, and I didn't want to open a new question for that so I edited mine. IF you have some content in front of these datas (lire time or lines) and you want to keep them while sorting them out, how can we make it please ? – user1619114 Aug 27 '12 at 21:14