3

I'm using awk to merge multiple (>3) files, and I want to keep the headers. I found a previous post that does exactly what I need, but I don't quite understand what's happening. I was hoping someone could walk me through it so I can learn from it! (I tried commenting on the original post but did not have enough reputation)

This code

awk '{a[FNR]=((a[FNR])?a[FNR]FS$2:$0)}END{for(i=1;i<=FNR;i++) print a[i]}' f*

transforms the input files as desired. See example tables below.

Input files:

file1.txt:

id    value1
a     10
b     30
c     50

file2.txt:

id    value2
a     90
b     30
c     20

file3.txt:

id    value3
a     0
b     1
c     25

desired output

merge.txt:

id    value1  value2  value3
a     10      90      0
b     30      30      1
c     50      20      25

Again, here's the code

awk '{a[FNR]=((a[FNR])?a[FNR]FS$2:$0)}END{for(i=1;i<=FNR;i++) print a[i]}' f* > merge.txt

I'm having trouble understanding the first part of the code {a[FNR]=((a[FNR])?a[FNR]FS$2:$0)}, but understand the loop in the second part of the code.

I think in the first part of the code, an array is being established. The code runs through and check for matching records on the first column id, and if there's a match then append the second column ($2) value and print the entire record ($0).

But...I don't understand the beginning syntax. When is it established that the first column id is the same across all three files and to only add the second column?

moxed
  • 343
  • 1
  • 6
  • 16
  • 1
    That code does not use or reference the `id`, it does everything based on the order of the lines not the value of the first field on each line. If you need to check the `id`s because they can vary across files then you need a different solution. – Ed Morton Apr 17 '17 at 17:22
  • 1
    Ah. Good to know. I do want the program to check the ids. It turns out that for the current files I'm working with the ids are the same, but in the future this won't always be the case. If I wanted to incorporate referencing the ids, what would be the best solution? Replace `FNR` in `a[FNR])?a[FNR]FS$2:$0)` with `$1`? -- NVM saw your code below. Thanks! – moxed Apr 17 '17 at 18:02

5 Answers5

4

First the data:

file1              file2              file3

NR FNR $1 $2       NR FNR $1 $2       NR FNR $1 $2
================   ================   ================
1  1   id value1   5  1   id value2   9  1   id value3
2  2   a  10       6  2   a  90       10 2   a  0
3  3   b  30       7  3   b  30       11 3   b  1
4  4   c  50       8  4   c  20       12 4   c  25

The first part: a[FNR]=( (a[FNR]) ? a[FNR]FS$2 : $0 ) could be written as:

if(a[FNR]=="")           # actually if(a[FNR]=="" || a[FNR]==0)
    a[FNR]=$0            # a[FNR] is "id value1" when NR==1
else
    a[FNR]=a[FNR] FS $2  # a[FNR]="id value1" FS "value2" when NR==5

Each file has 4 records, ie. FNR==4 on the last record of each file, especially the last file, since value of FNR remains after processing the last file:

END {                    # after hashing all record in all files
    for(i=1;i<=FNR;i++)  # i=1, 2, 3, 4
        print a[i]       # print "id value1 value value3" etc.
}
James Brown
  • 36,089
  • 7
  • 43
  • 59
4

That code is buggy and unnecessarily complicated, use this instead:

$ awk 'NR==FNR{a[FNR]=$0; next} {a[FNR] = a[FNR] OFS $2} END{for (i=1;i<=FNR;i++) print a[i]}' file1 file2 file3
id    value1 value2 value3
a     10 90 0
b     30 30 1
c     50 20 25

pipe the output to column -t for alignment if you like:

$ awk 'NR==FNR{a[NR]=$0;next} {a[FNR] = a[FNR] OFS $2} END{for (i=1;i<=FNR;i++) print a[i]}' file1 file2 file3 | column -t
id  value1  value2  value3
a   10      90      0
b   30      30      1
c   50      20      25

If you NEED to key off ids (e.g. because they differ across the files) then it'd be:

$ awk '
    BEGIN { OFS="\t" }
    !($1 in a) { ids[++numIds]=$1 }
    { a[$1][ARGIND]=$2 }
    END {
        for (i=1;i<=numIds;i++) {
            id = ids[i]
            printf "%s%s", id, OFS
            for (j=1;j<=ARGIND;j++) {
                printf "%s%s", a[id][j], (j<ARGIND ? OFS : ORS)
            }
        }
    }
' file1 file2 file3 | column -s$'\t' -t
id  value1  value2  value3
a   10      90      0
b   30      30      1
c   50              25
x           20

That last script used GNU awk for multi-dimensional arrays and just had c changed to x in input file2 to test it.

Feel free to ask if you have questions but I THINK that code is pretty clear.

Ed Morton
  • 188,023
  • 17
  • 78
  • 185
2

James has explained pretty well the awk logic in his answer.

In case you're looking for an alternative here is a paste based solution:

paste file1 file2 file3 | awk '{print $1, $2, $4, $6}' OFS='\t'

id  value1  value2  value3
a   10  90  0
b   30  30  1
c   50  20  25
Community
  • 1
  • 1
anubhava
  • 761,203
  • 64
  • 569
  • 643
1

FNR is the number of records relative to the current input file. So the line number in file1, file2 etc. http://www.thegeekstuff.com/2010/01/8-powerful-awk-built-in-variables-fs-ofs-rs-ors-nr-nf-filename-fnr/?ref=binfind.com/web

The ? is the ternary operator and is saying, if there's already something in a[FNR] then append $2 of the current record to what's there, else it's null so store the whole record (i.e. $0).

Pseudo code that may help explain things:

if a[FNR] != ""
  a[FNR] = a[FNR] : FS : $2
else
  a[FNR] = $0

You can see that the a, b, c from every record after the first file is dropped - could be x, y, z and this program wouldn't care. It's taking the second field and appending to a[2], a[3] etc.

Ian McGowan
  • 3,461
  • 3
  • 18
  • 23
1

You can use awk with pr to do this:

$ pr -mts$'\t' f1 <(awk '{print $2}' f2) <(awk '{print $2}' f3) 
id    value1    value2  value3
a     10    90  0
b     30    30  1
c     50    20  25

(Those are tabs in between the columns)

Or use paste the same way:

$ paste f1 <(awk '{print $2}' f2) <(awk '{print $2}' f3)
id    value1    value2  value3
a     10    90  0
b     30    30  1
c     50    20  25
dawg
  • 98,345
  • 23
  • 131
  • 206