Compare two columns of two files and count the differences

Question

I have two files, tab separated, where I want to compare line by line the values of column 1 of file1 with column 1 of file2 and so forth until n columns.

The comparisons are to count the differences.

Values in columns can be either 0, 1 or 2, for example:

File1:

col1 col2 col3 col4
1 1 1 2
1 1 1 2
2 1 2 2
2 1 2 2

File2:
col1 col2 col3 col4
1 1 1 1
1 1 0 1
0 1 0 1
1 0 1 0

Results
2 1 3 4

So, col1 of file1 and file2 with 2 differences, col2 off file1 and file2 with 1 difference and so forth... I have seen many similar questions in AWK but the majority of them is to compare columns and append a column from either files if matches or not, but not count differences.

I believe the comparison of not match from two columns would start with something like this, but from there I am totally lost...

awk 'NR==FNR { a[$1]!=$1; next}

Thanks

`col1 of file1 and file2 with 2 differences`: Are you sure about this? It is same in both files i.e. `1 1 2 2` values? — anubhava, Jul 27 '21 at 14:45

anubhava · Accepted Answer · 2021-07-27T17:36:00.357

You may use this awk:

awk 'BEGIN{FS=OFS="\t"} FNR == NR {for (i=1; i<=NF; ++i) a[i,FNR] = $i; next} FNR > 1 {for (i=1; i<=NF; ++i) if ($i != a[i,FNR]) ++out[i]; ncol=NF} END {print "Results"; for (i=1; i <= ncol; ++i) printf "%s%s", out[i]+0, (i < ncol ? OFS : ORS)}' f2 f1

Results
2   1   3   4

A more readable form:

awk 'BEGIN {FS=OFS="\t"}
FNR == NR {
   for (i=1; i<=NF; ++i)
      a[i,FNR] = $i
   next
}
FNR > 1 {
   for (i=1; i<=NF; ++i)
      if ($i != a[i,FNR])
         ++out[i]
}
END {
   print "Results"
   for (i=1; i <= NF; ++i)
      printf "%s%s", out[i]+0, (i < ncol ? OFS : ORS)
}' f2 f1

score 3 · Answer 2 · edited Jul 27 '21 at 17:06

3

If you have paste available you can do this without storing anything in an array except the output

paste File1 File2 |
awk '
    NR > 1 {
        mid = NF/2
        for (i=1; i<=mid; i++) {
            count[i] += ( $i == $(mid+i) ? 0 : 1 )
        }
    }
    END {
        for (i=1; i<=mid; i++) {
            printf "%d%s", count[i], (i<mid ? OFS : ORS)
        }
    }
'

Output:

2 1 3 4

edited Jul 27 '21 at 17:06

Ed Morton

188,023
17
78
185

answered Jul 27 '21 at 15:15

IceCreamToucan

28,083
2
22
38

Renaud Pacalet · Answer 3 · 2021-07-27T15:25:35.340

1

With getline:

$ cat foo.awk
NR == 1 { n = NF; }
{
  if(NF != n) { print "error"; exit 1; }
  for(i = 1; i <= n; i++) a[i] = $i;
  if(getline < f != 1 || NF != n) { print "error"; exit 1; }
  for(i = 1; i <= NF; i++) if($i && a[i] != $i) c[i] += 1;
}
END {
  for(i = 1; i <= n; i++) printf("%d%c", c[i], (i == n) ? "\n" : " ");
}

$ awk -v f=File1 -f foo.awk File2
2 1 3 4

Explanation:

Variable f holds the name of the first file, we pass it to awk with the -v f=File1 option and we pass the second file name (File2) to awk as the file to process.
We set n (number of fields) from the first line of the second file. Later, if we encounter a line with a different number of fields in one of the two files we exit with an error message.
We fill array a with the fields from the current line.
Then we read the next line form the first file with getline, which sets the current fields with the new values. We exit with an error message if getline fails.
We compare the fields with array a and increment elements of array c if a difference is found.
At the end we print array c.

Note: some awk experts advocate against getline. If you prefer avoiding it too, prefer the solutions that pass File1 and File2 to awk and store the content of the first one in an array. But if your files are large remember that you could encounter memory issues, while the getline-based solution could process billions of lines of hundreds of fields without any problem (but would you use awk in this case?).

edited Jul 27 '21 at 15:25

answered Jul 27 '21 at 14:57

Renaud Pacalet

25,260
3
34
51

It's good that you provide a warning about `getline` when you use it, it'd be even more useful if you referenced http://awk.freeshell.org/AllAboutGetline which explains the issues. – Ed Morton Jul 27 '21 at 16:56
The statement `getline < f != 1` is undefined behavior. Any given awk could interpret that as `(getline < f) != 1` or `getline < (f != 1)` or something else. Any time you have an expressiion (e.g. `f != 1`) on the right side of input or output redirection it must be parenthesized to be portable to all awks. – Ed Morton Jul 27 '21 at 17:20
`for(i = 1; i <= n; i++) a[i] = $i` = `split($0,a)` – Ed Morton Jul 27 '21 at 17:21
`if($i &&` would fail if `$i` was `0` and I don't think it's necessary since you already checked the `n NF`. – Ed Morton Jul 27 '21 at 17:24
It's better to print `ORS` than `""\n"` because the latter is hard-coding a value that you hope/assume that `ORS` has but would cause the script to fail if you set `-v ORS='\r\n'` or something. – Ed Morton Jul 27 '21 at 17:25

James Brown · Answer 4 · 2021-07-27T16:41:41.897

As the values in the fields are single chars (0,1,2), we exclude the headers and pack field values to field number indexed strings without delimiters (for example a[1]="1122") and use substr() for extracting char for comparing ($i!=substr(a[i],FNR-1,1)):

awk '
NR==FNR && NR>1 {                         # process first file, ignore header
    for(i=1;i<=NF;i++)                    # since column values are 1 digit only
        a[i]=a[i] $i                      # just catenate themem, no separators
    next
}
FNR>1 {                                   # process second file
    for(i=1;i<=NF;i++)
        r[i]+=($i!=substr(a[i],FNR-1,1))  # compare field data and count mismatches
}
END {                                     # in the end
    for(i=1;(i in r);i++)                 # loop and ...
        printf "%s%s",(i==1?"":OFS),r[i]  # output
    print ""
}' file1 file2

Output:

2 1 3 4

Notice: This only works for single char values, as requested in the OP.

Compare two columns of two files and count the differences

4 Answers4