Compare one field, Remove duplicate if value of another field is greater

Question

Trying to do this at linux command line. Wanting to combine two files, compare values based on ID, but only keeping the ID that has the newer/greater value for Date (edit: equal to or greater than). Because the ID 456604 is in both files, wanting to only keep the one from File 2 with the newer date: "20111015 456604 tgf"

File 1

Date     ID     Note
20101009 456604 abc
20101009 444444 abc
20101009 555555 abc
20101009 666666 xyz

File 2

Date     ID     Note
20111015 111111 abc
20111015 222222 abc
20111015 333333 xyz
20111015 456604 tgf

And then the output to have both files combined, but only keeping the second ID value, with the newer date. The order of the rows are in does not matter, just example of the output for concept.

Output

Date     ID     Note
20101009 444444 abc
20101009 555555 abc
20101009 666666 xyz
20111015 111111 abc
20111015 222222 abc
20111015 333333 xyz
20111015 456604 tgf

jas · Accepted Answer · 2014-12-30T09:11:02.197

2

$ cat file1.txt file2.txt | sort -ru | awk '!($2 in seen) { print; seen[$2] }'

Date     ID     Note
20111015 456604 tgf
20111015 333333 xyz
20111015 222222 abc
20111015 111111 abc
20101009 666666 xyz
20101009 555555 abc
20101009 444444 abc

Sort the combined files by descending date and only print a line the first time you see an ID.

EDIT

More compact edition, thanks to Steve:

cat file1.txt file2.txt | sort -ru | awk '!seen[$2]++'

edited Dec 30 '14 at 09:11

answered Dec 30 '14 at 01:27

jas

10,715
2
30
41

The less AWKward way: `... | awk '!seen[$2]++'` – Steve Dec 30 '14 at 02:26
The first one worked perfectly and printed out the rest of the fields. Was using sed -i previously to match and remove using a for loop, but this is much faster since it's not requiring to rewrite the file each time. The second one only printed out the IDs. – 4141n Dec 31 '14 at 16:15

score 0 · Answer 2 · answered Dec 30 '14 at 01:10

You didn't specify how you'd like to handle the case were the dates are also duplicated, or even if this case could exist. Therefore, I have assumed that by 'greater', you really mean 'greater or equal to' (it also makes handling the header a tiny bit easier). If that's not the case, please edit your question.

awk code:

awk 'FNR==NR {
    a[$2]=$1
    b[$2]=$0
    next
}

a[$2] >= $1 {
    print b[$2]
    delete b[$2]
    next
}

1

END {
    for (i in b) {
        print b[i]
    }
}' file2 file1

Explanation:

Basically, we use an associative array, called a, to store the 'ID' and 'Date' as key and value, respectively. We also store the contents of file2 in memory using another associative array called b. When file1 is read, we test if column two exists in our array, a, and that the key's value is greater or equal to column one. If it is, we print the corresponding line from array b, then delete it from the array, and next onto the next line/record of input. The 1 on it's lonesome will return true, thereby enabling printing where the previous (two) conditions are not met. This has the effect of printing any unmatched records from file1. Finally, we print what's left in array b.

Results:

Date     ID     Note
20111015 456604 tgf
20101009 444444 abc
20101009 555555 abc
20101009 666666 xyz
20111015 222222 abc
20111015 111111 abc
20111015 333333 xyz

You could just set `b[$2]=$0` in the block for after `a[$2] >= $1` and then you wouldn't need to delete anything. — , Dec 30 '14 at 03:14
@Jidder: I see what you're trying to do, but unless I'm mistaken, that won't have the desired effect. This is because I read `file2` first, then `file1`. — Steve, Dec 30 '14 at 04:11
yep, you're right, my bad. Won't that mean if the date is higher in the second file it will be printed twice though ? Not sure if OP would care about that though as they didn't specify. — , Dec 30 '14 at 04:21
@Jidder: Correct; and some code would need to be added to handle that situation. Though, given the sample input, I think that looks unlikely. OP to clarify. — Steve, Dec 30 '14 at 14:40

score 0 · Answer 3 · 2014-12-30T03:23:42.603

Another awk way

awk 'NR==1;FNR>1{a[$2]=(a[$2]<$1&&b[$2]=$3)?$1:a[$2]}
     END{for(i in a)print a[i],i,b[i]}' file file2

Compares value in an array to previously stored value to determine which is higher, also stores the third field if current record is higher.
Then prints out the stored date,key(field 2) and the value stored for field 3.

Or shorter

awk 'NR==1;FNR>1{(a[$2]<$1&&b[$2]=$0)&&a[$2]=$1}END{for(i in b)print b[i]}' file file2

Compare one field, Remove duplicate if value of another field is greater

3 Answers3