Eliminate partially duplicate lines by column and keep the last one

Question

I have a file that looks like this:

2011-03-21 name001 line1
2011-03-21 name002 line2
2011-03-21 name003 line3
2011-03-22 name002 line4
2011-03-22 name001 line5

for each name, I only want its last appearance. So, I expect the result to be:

2011-03-21 name003 line3
2011-03-22 name002 line4
2011-03-22 name001 line5

Could someone give me a solution with bash/awk/sed?

PaulP · Accepted Answer · 2011-03-25T08:15:54.853

39

This code get uniq lines by second field but from the end of file or text (like in your result example)

tac temp.txt | sort -k2,2 -r -u

edited Mar 25 '11 at 08:15

answered Mar 25 '11 at 08:08

PaulP

1

Make sure that the last line of your input file contains a \n otherwise tac will concatenate it with the last but one line – Rishi Dua Jul 08 '14 at 17:38
To specify another separator, use -t: `tac temp.txt | sort -k1,1 -r -u -t@` – Simon Lang Apr 18 '17 at 20:56
Would you mind explaining the sort parameters `-k2,2`? :) – myradio Nov 03 '19 at 13:03
@myradio There is good description in wiki [here](https://en.wikipedia.org/wiki/Sort_(Unix)#Columns_or_fields) and [here](https://en.wikipedia.org/wiki/Sort_(Unix)#Sort_on_multiple_fields) – PaulP Nov 25 '19 at 06:17

pepoluan · Answer 2 · 2011-03-25T08:24:55.073

11

awk '{a[$2]=$0} END {for (i in a) print a[i]}' file

If order of appearance is important:

Based on first appearance:

awk '!a[$2] {b[++i]=$2} {a[$2]=$0} END {for (i in b) print a[b[i]]}' file

Based on last appearance:

tac file | awk '!a[$2] {b[++i]=$2} {a[$2]=$0} END {for (i in b) print a[b[i]]}'

edited Mar 25 '11 at 08:24

answered Mar 25 '11 at 08:04

pepoluan

This is good - simple and robust. The order of the output does not match the order of the output if that is important though. Is there an easy way to fix that? – Paul Mar 25 '11 at 08:11
@Paul yes, but this will result in a much more complex awk program. I'll edit my answer. – pepoluan Mar 25 '11 at 08:12
Actually, I was meaning just reversing the printing of the array rather than which entry was selected. So that the output would be in time order: line 3, line 4, line 5 rather than line 5, line 4, line 3. +1 from me for the first simple answer. Oh wait, yeah - I see that is what you were doing - it does get stupidly complex. – Paul Mar 25 '11 at 08:24
@Paul oh, I misunderstood :) ... well, you can always pipe its output to `sort`. would be much simpler than trying to cram everything in `awk`. – pepoluan Mar 25 '11 at 08:26
I used the simplest one, and add sort on time stamp field after that. Really a good solution, thanks! – Dagang Mar 25 '11 at 10:19

nkvnkv · Answer 3 · 2011-06-23T07:27:58.843

6

sort < bar > foo
uniq  < foo > bar

bar now has no duplicated lines

edited Jun 23 '11 at 07:27

answered Jun 23 '11 at 06:28

nkvnkv

1

Given the OP's example, all the lines would be counted as unique. He only wants the second field to be used to determine uniqueness. – gdw2 Mar 01 '12 at 15:13
1

+1 ...but this answers the title ('bash eliminate duplicate lines' at the moment), which is what Google seemed to use to send me here! – sage Dec 27 '13 at 23:26

Erik · Answer 4 · 2011-03-25T08:04:17.747

3

EDIT: Here's a version that actually answers the question.

sort -k 2 filename | while read f1 f2 f3; do if [ ! "$f2" = "$lf2" ]; then echo "$f1 $f2 $f3"; lf2="$f2"; fi; done

edited Mar 25 '11 at 08:04

answered Mar 25 '11 at 07:54

Erik

I believe awk script implementing the same logic should be more efficient. – wass rubleff Jun 19 '19 at 20:49

4 Answers4