Shell - diff between files and parameters in each line

Question

I am looking for advance two file compare shell/bash/php. Let say i have such files:

file1

.run file=test_script.sql rev=1.1
.run file=test_sql.sql rev=1.1
.run file=test_drop.sql rev=1.2

file2

.run file=test_drop.sql rev=1.2
.run file=test_grant.sql rev=1.1
.run file=test_script.sql rev=1.2

get the difference between those files (ignoring line order) that is

.run file=test_grant.sql rev=1.1 #(because new line wasn't in file1 at all)
.run file=test_script.sql rev=1.2 #(because rev changed from rev=1.1 to rev=1.2)

but that is not all, i want to check if there was same (.run file=name) in the old file and if it was then get it's (rev=number). So that the final output will look like this:

file3:

 test_grant.sql 1.1 1.1
 test_script.sql 1.1 1.2

so far: fgrep -x -v -f file1 file2

gets

.run file=test_grant.sql rev=1.1
.run file=test_script.sql rev=1.2

not much: fgrep -x -v -f file1 file2 gets `.run file=test_grant.sql rev=1.1` `.run file=test_script.sql rev=1.2` — Makaron, Jan 28 '16 at 17:15
yes they are unique, in the same file it can't be file= with the same name — Makaron, Jan 28 '16 at 17:21
At this point i don't need info on what was removed only added newly or changed. It will make the end output even more complex cause i will do further jobs on information collected. Unless it would save in separate files let say to one file what was added and changed only e.g: file_added_modified: `.run file=test_grant.sql rev=1.1 .run file=test_script.sql rev=1.2 `, and file what was removed eg.: file_removed: `.run file=test_sql.sql rev=1.1` — Makaron, Jan 28 '16 at 17:35

Etan Reisner · Accepted Answer · 2016-01-28T18:38:45.207

1

This awk script should do what you want:

awk 'NR==FNR {
    map[$2]=$3
    next;
}

!map[$2] || (map[$2] != $3) {
    sub3=substr($3, index($3,"=")+1)
    subm2=substr(map[$2], index(map[$2],"=")+1)
    print substr($2, index($2,"=")+1), subm2?subm2:sub3, sub3
}' file1 file2

While looking at the first file (NR==FNR) store the rev field in the map array under the file key.

While looking at the second file (the second block) if the file field in this line isn't in the map array or the current rev field doesn't match the matching rev field then print the current line.

To handle lines that were removed you would want to add {delete map[$2]} after the second block and then add END {for (rev in map) {print "Missing: .run "map[rev]" "rev}} to the end.

edited Jan 28 '16 at 18:38

answered Jan 28 '16 at 17:36

Etan Reisner

77,877
8
106
148

1

Nice, although the diff report would imply something like `print $2, map[$2], map[$3]`, except deleting up to `=` in each value. Side note: the idiom `FR==FNR` is really accident-prone because it will recognize the second file if the first file happens to be empty. `ARGIND==1` doesn't work either, because `ARGIND` counts non-file arguments like `V=1`. Personally, I prefer `ENDFILE{++files}` which lets you use `files` as a reasonably robust indicator. – rici Jan 28 '16 at 18:18
Maybe I'm reading it wrong. It was the place where OP says "So that the final output will look like this:..." – rici Jan 28 '16 at 18:32
@rici Yeah, I almost mentioned that `NR==FNR` has problems with empty files but didn't want to get into a whole discussion about how there isn't a non-GNU `awk` way to do it that is reliable. I haven't updated to think about `ENDFILE` yet since most of my `awk`ing is on CentOS 5 (awk 3.1.5) which doesn't have it but yes, that's a nice idiom. I hadn't noticed that about `ARGIND` that's annoying. – Etan Reisner Jan 28 '16 at 18:33
@rici Oh, wow, I totally skipped that block. Updating, thanks. – Etan Reisner Jan 28 '16 at 18:34
If you have ARGIND but not ENDFILE (both are gawk extensions, so there are no guarantees), you can still do it but its ugly (you have to manually check to see if arguments match the var=value syntax). Without either, it's really annoying because checking against FILENAME fails when files are repeated, but it can still be accomplished using a combination of that and the FNR==NR test. CentOS-- imho. – rici Jan 28 '16 at 18:51
@rici CentOS 7 finally comes mostly into the modern era and includes gawk-4.0.2 (CentOS 6 is only awk 3.1.7). – Etan Reisner Jan 28 '16 at 19:09
1

Perfect, thanks! Also realised that i need small change - if `file=` wasn't listed in `file2` then instead of taking it's revision twice i need hardcoded first entry `1.1` so just added it there: `awk 'NR==FNR {map[$2]=$3; next;} !map[$2] || (map[$2] != $3) { sub3=substr($3, index($3,"=")+1); subm2=substr(map[$2], index(map[$2],"=")+1); print substr($2, index($2,"=")+1), subm2?subm2:'1.1',sub3}' file1 file2` – Makaron Jan 29 '16 at 07:45
So i figured out that i need removed files after all. Came up with this, it will put `rev=0` this i will know that it was removed. Don't know if it's grammatically correct but at least works :) `awk 'NR==FNR {map[$2]=$3; next;} !map[$2] || (map[$2] != $3) { sub3=substr($3, index($3,"=")+1); subm2=substr(map[$2], index(map[$2],"=")+1); print substr($2, index($2,"=")+1), subm2?subm2:'1.1',sub3}{delete map[$2]}END {for (rev in map) {rmsubrev=substr(rev,index(rev,"=")+1); rmsubmrev=substr(map[rev],index(map[rev],"=")+1); print rmsubrev,'0',rmsubmrev;}}' file1 file2` @EtanReisner thanks again – Makaron Feb 03 '16 at 15:27

Shell - diff between files and parameters in each line

1 Answers1