0

I would like to compare multiple columns from 2 files and NOT print lines matching my criteria. An example of this would be:

file1

apple  green  4
orange  red  5
apple  yellow 6
apple  yellow 8
grape  green 5

file2

apple  yellow 7
grape  green 10

output

apple  green  4
orange  red  5
apple  yellow 8

I want to remove lines where $1 and $2 from file1 correspond to $1 and $2 from file2 AND when $3 from file1 is smaller than $3 from file2. I can now only do the first part of the job, that is remove lines where $1 and $2 from file1 correspond to $1 and $2 from file2 (fields are separated by tabs):

awk -F '\t' 'FNR == NR {a[$1FS$2]=$1; next} !($1FS$2 in a)' file2 file1

Could you help me apply the last condition?

Many thanks in advance!

Agathe
  • 303
  • 2
  • 15

2 Answers2

3

Store 3rd field value while building the array and then use it for comparison

$ awk -F '\t' 'FNR==NR{a[$1FS$2]=$3; next} !(($1FS$2 in a) && $3 > a[$1FS$2])' f2 f1
apple   green   4
orange  red 5
apple   yellow  6
grape   green   5

Better written as:

awk -F '\t' '{k = $1FS$2} FNR==NR{a[k]=$3; next} !((k in a) && $3 > a[k])' f2 f1
Sundeep
  • 23,246
  • 2
  • 28
  • 103
  • 1
    The fieldseparator is not needed as `awk` uses blanks by default (including tabs) – kvantour Sep 05 '18 at 15:09
  • true, just in case someone reading the answer in future has fields separated by tabs but fields can contain spaces ;) – Sundeep Sep 05 '18 at 15:30
3

What you are after is this:

awk '(NR==FNR){a[$1,$2]=$3; next}!(($1,$2) in a) && a[$1,$2] < $3))' <file2> <file1>
kvantour
  • 25,269
  • 4
  • 47
  • 72
  • When you're using the same hard-coded set of values as a key in multiple locations, seriously consider using a variable to hold the key instead: `'{key=$1 FS $2} (NR==FNR){a[key]=$3; next}!(key in a) && a[key] < $3))'`. Makes life easier if/when you need to change the key and IMHO improves clarity. – Ed Morton Sep 05 '18 at 21:27
  • It works but I had to remove the last two parentheses: – Agathe Sep 06 '18 at 11:03
  • @Agathe Which parenthesis? – kvantour Sep 06 '18 at 11:09
  • It works but I had to remove the last two parentheses: `awk '(NR==FNR){a[$1,$2]=$3; next}!(($1,$2) in a) && a[$1,$2] < $3' ` Also, can I ask for an explanation of the code? Especially, I do not understand the `a[key]=$3` and `a[key]<$3` parts. If I understand right, you are creating an array called `a` in which you store `$1FS$2` (=`key`) from file 2 and then you look for `$1FS$2` in file 1 to not print it. However, why does it seem you are assigning the array or comparing it to a column `$3`? – Agathe Sep 06 '18 at 11:21
  • @Agathe `a[key]` means the element of array `a` with index `key`. So `a[key]=$3` means to store the value of `$3` in the array-element `a[key]` . The statement `key in a` is query that checks if array `a` has an element with index `key` – kvantour Sep 06 '18 at 12:05