AWK command to simulate full outer join and then compare

Question

Hello Guys I need a help in building an awk command which can simulate full outer join and then compare values

Say

cat File1
1|A|B
2|C|D
3|E|F

cat File2
1|A|X
2|C|D
3|Z|F

Assumptions

first column in both the files is the key field so no duplicates
both the files are expected to have same structure
No limit on the number of fields

Now, If I run the awk command

awk -F'|' ........... File1 File2 > output

Output format

<Key>|<File1.column1>|<File2.column1>|<Matched/Mismatched>|<File1.column2>|<File2.column2>|<Matched/Mismatched>|<File1.column3>|<File2.column3>|<Matched/Mismatched>

cat output
1|A|A|MATCHED|B|X|MISMATCHED
2|C|C|MATCHED|D|D|MATCHED
3|E|Z|MISMATCHED|F|F|MATCHED

Thank You

James Brown · Answer 1 · 2016-08-03T11:35:03.830

1

$ awk -v OFS=\| -F\| 'NR==FNR{for(i=2;i<=NF;i++)a[$1][i]=$i;next}{printf "%s",$1;for(i=2;i<=NF;i++){printf"%s|%s|%s",a[$1][i],$i,a[$1][i]==$i?"matched":"mismatched"}printf"\n"}' file1 file2
1|A|A|matched|B|X|mismatched
2|C|C|matched|D|D|matched
3|E|Z|mismatched|F|F|matched

BEGIN {
  OFS="|"; FS="|"
}
NR==FNR {                # for the first file
  for(i=2;i<=NF;i++)     # fill array with "non-key" fields
    a[$1][i]=$i;next     # and use the "key" field as an index
}
{
  printf "%s",$1        
  for(i=2;i<=NF;i++) {   # use the key field to match and print
    printf"|%s|%s|%s",a[$1][i],$i,a[$1][i]==$i?"matched":"mismatched"
  }
  printf"\n"             # sugar on the top
}

edited Aug 03 '16 at 11:35

answered Aug 03 '16 at 11:24

James Brown

36,089
7
43
59

1

+1 for the right approach. You should mention that is GNU awk specific for true multi-dimensional arrays. wrt `a[$1][i]=$i;next` IMHO it's extremely misleading syntax to put the `next` on the same line as the assignment as that makes it look like `next` gets executed when the assignment does within the loop when in fact it gets executed after the loop terminates. `print ""` is always better than hardcoding `printf "\n"` as it will work correctly for any `RS` value. Unparenthesized ternary expressions cause syntax errors in some awks in some contexts. Finally - white space helps readability. – Ed Morton Aug 03 '16 at 16:14
1

Thanks for your tips. I just that missed `next` and the rest is due to inexperiece. – James Brown Aug 03 '16 at 17:44

karakfa · Answer 2 · 2016-08-03T14:24:57.637

perhaps easier with join assist

$ join -t'|' file1 file2 | 
  awk -F'|' -v OFS='|' '{n="MIS"; m="MATCHED"; 
                         m1=($2!=$4?n:"")m; 
                         m2=($3!=$5?n:"")m; 
                         print $1,$2,$4,m1,$3,$5,m2}'

1|A|A|MATCHED|B|X|MISMATCHED
2|C|C|MATCHED|D|D|MATCHED
3|E|Z|MISMATCHED|F|F|MATCHED

for unspecified number of fields need more awk

$ join -t'|' file1 file2 | 
  awk -F'|' '{c=NF/2; printf "%s", $1; 
              for(i=2;i<=c+1;i++) printf "|%s|%s|%s", $i,$(i+c),($i!=$(i+c)?"MIS":"")"MATCHED"; 
              print ""}'

score 0 · Answer 3 · answered Aug 03 '16 at 16:03

$ cat tst.awk
BEGIN { FS=OFS="|" }
NR==FNR {
    for (i=2; i<=NF; i++) {
        a[$1,i] = $i
    }
    next
}
{
    printf "%s%s", $1, OFS
    for (i=2; i<=NF; i++) {
        printf "%s%s%s%s%s%s", a[$1,i], OFS, $i, OFS, (a[$1,i]==$i ? "" : "MIS") "MATCHED", (i<NF ? OFS : ORS)
    }
}

$ awk -f tst.awk file1 file2
1|A|A|MATCHED|B|X|MISMATCHED
2|C|C|MATCHED|D|D|MATCHED
3|E|Z|MISMATCHED|F|F|MATCHED

AWK command to simulate full outer join and then compare

Assumptions

3 Answers3