0

Hello Guys I need a help in building an awk command which can simulate full outer join and then compare values

Say

cat File1
1|A|B
2|C|D
3|E|F

cat File2
1|A|X
2|C|D
3|Z|F

Assumptions

  1. first column in both the files is the key field so no duplicates
  2. both the files are expected to have same structure
  3. No limit on the number of fields

Now, If I run the awk command

awk -F'|' ........... File1 File2 > output

Output format

<Key>|<File1.column1>|<File2.column1>|<Matched/Mismatched>|<File1.column2>|<File2.column2>|<Matched/Mismatched>|<File1.column3>|<File2.column3>|<Matched/Mismatched>

cat output
1|A|A|MATCHED|B|X|MISMATCHED
2|C|C|MATCHED|D|D|MATCHED
3|E|Z|MISMATCHED|F|F|MATCHED

Thank You

Mark Setchell
  • 191,897
  • 31
  • 273
  • 432
Diwakar
  • 1
  • 1

3 Answers3

1
$ awk -v OFS=\| -F\| 'NR==FNR{for(i=2;i<=NF;i++)a[$1][i]=$i;next}{printf "%s",$1;for(i=2;i<=NF;i++){printf"%s|%s|%s",a[$1][i],$i,a[$1][i]==$i?"matched":"mismatched"}printf"\n"}' file1 file2
1|A|A|matched|B|X|mismatched
2|C|C|matched|D|D|matched
3|E|Z|mismatched|F|F|matched

BEGIN {
  OFS="|"; FS="|"
}
NR==FNR {                # for the first file
  for(i=2;i<=NF;i++)     # fill array with "non-key" fields
    a[$1][i]=$i;next     # and use the "key" field as an index
}
{
  printf "%s",$1        
  for(i=2;i<=NF;i++) {   # use the key field to match and print
    printf"|%s|%s|%s",a[$1][i],$i,a[$1][i]==$i?"matched":"mismatched"
  }
  printf"\n"             # sugar on the top
}
James Brown
  • 36,089
  • 7
  • 43
  • 59
  • 1
    +1 for the right approach. You should mention that is GNU awk specific for true multi-dimensional arrays. wrt `a[$1][i]=$i;next` IMHO it's extremely misleading syntax to put the `next` on the same line as the assignment as that makes it look like `next` gets executed when the assignment does within the loop when in fact it gets executed after the loop terminates. `print ""` is always better than hardcoding `printf "\n"` as it will work correctly for any `RS` value. Unparenthesized ternary expressions cause syntax errors in some awks in some contexts. Finally - white space helps readability. – Ed Morton Aug 03 '16 at 16:14
  • 1
    Thanks for your tips. I just that missed `next` and the rest is due to inexperiece. – James Brown Aug 03 '16 at 17:44
0

perhaps easier with join assist

$ join -t'|' file1 file2 | 
  awk -F'|' -v OFS='|' '{n="MIS"; m="MATCHED"; 
                         m1=($2!=$4?n:"")m; 
                         m2=($3!=$5?n:"")m; 
                         print $1,$2,$4,m1,$3,$5,m2}'

1|A|A|MATCHED|B|X|MISMATCHED
2|C|C|MATCHED|D|D|MATCHED
3|E|Z|MISMATCHED|F|F|MATCHED

for unspecified number of fields need more awk

$ join -t'|' file1 file2 | 
  awk -F'|' '{c=NF/2; printf "%s", $1; 
              for(i=2;i<=c+1;i++) printf "|%s|%s|%s", $i,$(i+c),($i!=$(i+c)?"MIS":"")"MATCHED"; 
              print ""}'
karakfa
  • 66,216
  • 7
  • 41
  • 56
0
$ cat tst.awk
BEGIN { FS=OFS="|" }
NR==FNR {
    for (i=2; i<=NF; i++) {
        a[$1,i] = $i
    }
    next
}
{
    printf "%s%s", $1, OFS
    for (i=2; i<=NF; i++) {
        printf "%s%s%s%s%s%s", a[$1,i], OFS, $i, OFS, (a[$1,i]==$i ? "" : "MIS") "MATCHED", (i<NF ? OFS : ORS)
    }
}

$ awk -f tst.awk file1 file2
1|A|A|MATCHED|B|X|MISMATCHED
2|C|C|MATCHED|D|D|MATCHED
3|E|Z|MISMATCHED|F|F|MATCHED
Ed Morton
  • 188,023
  • 17
  • 78
  • 185