5

I have these two files

File1:

9 8 6 8 5 2
2 1 7 0 6 1
3 2 3 4 4 6

File2: (which has over 4 million lines)

MN 1 0
JK 2 0
AL 3 90
CA 4 83
MK 5 54
HI 6 490

I want to compare field 6 of file1, and compare field 2 of file 2. If they match, then put field 3 of file2 at the end of file1 I've looked at other solutions but I can't get it to work correctly.

Desired output:

9 8 6 8 5 2 0
2 1 7 0 6 1 0
3 2 3 4 4 6 490

My attempt:

awk 'NR==FNR{a[$2]=$2;next}a[$6]{print $0,a[$6]}' file2 file1

program just hangs after that.

adrotter
  • 301
  • 3
  • 10
  • Your attempt does not "hang" for me. Although it does not produce the desired answer, it runs to completion. – John1024 Jul 30 '15 at 23:32
  • How long is `file1`? Also, based on the example input it looks like `file2` is sorted by field 2, is this the case? – Arkku Jul 31 '15 at 00:40

2 Answers2

6

To print all lines in file1 with match if available:

$ awk 'FNR==NR{a[$2]=$3;next;} {print $0,a[$6];}' file2 file1
9 8 6 8 5 2 0
2 1 7 0 6 1 0
3 2 3 4 4 6 490

To print only the lines that have a match:

$ awk 'NR==FNR{a[$2]=$3;next} $6 in a {print $0,a[$6]}' file2 file1
9 8 6 8 5 2 0
2 1 7 0 6 1 0
3 2 3 4 4 6 490

Note that I replaced a[$2]=$2 with a[$2]=$3 and changed the test a[$6] (which is false if the value is zero) to $6 in a.

John1024
  • 109,961
  • 14
  • 137
  • 171
  • 1
    Wonder why this has been downvoted – it produces the desired output with minimal changes to the OP's code. – Arkku Jul 31 '15 at 00:45
1

Your own attempt basically has two bugs as seen in @John1024's answer:

  1. You use field 2 as both key and value in a, where you should be storing field 3 as the value (since you want to keep it for later), i.e., it should be a[$2] = $3.
  2. The test a[$6] is false when the value in a is zero, even if it exists. The correct test is $6 in a.

Hence:

awk 'NR==FNR { a[$2]=$3; next } $6 in a {print $0, a[$6] }' file2 file1

However, there might be better approaches, but it is not clear from your specifications. For instance, you say that file2 has over 4 million lines, but it is unknown if there are also that many unique values for field 2. If yes, then a will also have that many entries in memory. And, you don't specify how long file1 is, or if its order must be preserved for output, or if every line (even without matches in file2) should be output.

If it is the case that file1 has many fewer lines than file2 has unique values for field 2, and only matching lines need to be output, and order does not need to be preserved, then you might wish to read file1 first…

Arkku
  • 41,011
  • 10
  • 62
  • 84
  • Hi Arkku. file2's fields are all unique. It's numbers 1 to 4 million. All lines in file1 will show in output because there will always be a match. – adrotter Aug 05 '15 at 04:59