0

Hi guys I have two files each of them with N columns and M rows.

File1

1 2 4 6 8
20 4 8 10 12
15 5 7 9 11

File2

1 a1 b1 c5 d1
2 a1 b2 c4 d2
3 a2 b3 c3 d3
19 a3 b4 c2 d4
14 a4 b5 c1 d5

And what I need is to search the closest value in the column 1, and print specific columns in the output. so for example the output should be:

File3

1 2 4 6 8
1 a1 b1 c5 d1
20 4 8 10 12
19 a3 b4 c2 d4
15 5 7 9 11
14 a4 b5 c1 d5

Since 1 = 1, 19 is the closest to 20 and 14 to 15, the output are those lines. How can I do this in awk or any other tool?

Help!

This is what I have until now:

echo "ARGIND == 1 {
s1[\$1]=\$1;
s2[\$1]=\$2;
s3[\$1]=\$3;
s4[\$1]=\$4;
s5[\$1]=\$5;
}
ARGIND == 2 {
bestdiff=-1;
for (v in s1)
if (bestdiff < 0 || (v-\$1)**2 <= bestdiff) 
{
s11=s1[v];
s12=s2[v];
s13=s3[v];
s14=s4[v];
s15=s5[v];
bestdiff=(v-\$1)**2;
if (bestdiff < 2){
print \$0
print s11,s12,s13,s14,s15}}">diff.awk
awk -f diff.awk file2 file1

output:

1 2 4 6 8
1 a1 b1 c5 d1
20 4 8 10 12
19 a3 b4 c2 d4
15 5 7 9 1
14 a4 b5 c1 d5
1 2
1 1
14 15

I have no idea why the last three lines.

Nikko
  • 517
  • 3
  • 7
  • 19
  • 1
    of course the two files need to be input. Since you tagged with awk you may have starting coding something. Share it, please! – fedorqui Apr 23 '15 at 15:16
  • yeah I express my self wrongly. But I still have nothing to share. Any ideas? – Nikko Apr 23 '15 at 15:34
  • The number of lines in both files are the same? What do you mean by "closest"? If we have only one line with value "30" in first file and "40" in the second is it close enough? – Andrey Sabitov Apr 23 '15 at 17:30
  • @AndreySabitov the number of lines are not the same, and yes 30 is the closest to 40 if there isn't another closer enough. – Nikko Apr 23 '15 at 20:01

1 Answers1

1

What I ended with trying to give a way to answer:

function closest(b,i) { # define a function
  distance=999999; # this should be higher than the max index to avoid returning null
  for (x in b) { # loop over the array to get its keys
    (x+0 > i+0) ? tmp = x - i : tmp = i - x # +0 to compare integers, ternary operator to reduce code, compute the diff between the key and the target
    if (tmp < distance) { # if the distance if less than preceding, update
      distance = tmp
      found = x # and save the key actually found closest
    }
  }
  return found  # return the closest key
}

{ # parse the files for each line (no condition)
   if (NR>FNR) { # If we changed file (File Number Record is less than Number Record) change array
     b[$1]=$0 # make an array with $1 as key
   } else {
     akeys[max++] = $1 # store the array keys to ensure order at end as for (x in array) does not guarantee the order
     a[$1]=$0 # make an array with $1 as key
   }
}

END { # Now we ended parsing the two files, print the result
  for (i in akeys) { # loop over the first file keys
    print a[akeys[i]] # print the value for this file
    if (akeys[i] in b) { # if the same key exist in second file
      print b[akeys[i]] # then print it
    } else {
      bindex = closest(b,akeys[i]) # call the function to find the closest key from second file
      print b[bindex] # print what we found
    }
  }
}

I hope this is enough commented to be clear, feel free to comment if needed.

Warning This may become really slow if you have a large number of line in the second file as the second array will be parsed for each key of first file which is not present in second file./Warning

Given your sample inputs a1 and a2:

$ mawk -f closest.awk a1 a2
1 2 4 6 8
1 a1 b1 c5 d1
20 4 8 10 12
19 a3 b4 c2 d4
15 5 7 9 11
14 a4 b5 c1 d5
Tensibai
  • 15,557
  • 1
  • 37
  • 57
  • Thank you very much, works perfectly!. However is kind of too advance for me to modify it slightly. How can I do it if I need to find the closest for a second column. Meaning that after the routine finds the closest in the first column, now search the closest comparing the second column too?. – Nikko Apr 27 '15 at 16:10
  • I don't get the point, how would you compare numeric values with text values ? that's said there's a bug in my code, I print the `bindex` (first field from second file) twice (once as the index and once part of the line). I'll edit to fix this. – Tensibai Apr 27 '15 at 16:13
  • You are right. Yeah well the thing is I will compare only numbers of course. I was just searching a general way to do it but the it got more complicated than I thought. Just imaging that that all the columns are numbers. – Nikko Apr 27 '15 at 16:16
  • It would involve creating another array with second field ($2) and passing it to the function... But without a use case I'm unsure to understand the goal. Maybe you can write a new question with your trial and where it fails ? – Tensibai Apr 27 '15 at 16:19
  • You may reference this question to give background on your first steps if needed. – Tensibai Apr 27 '15 at 16:21
  • done! http://stackoverflow.com/questions/29901881/find-the-closest-values-multiple-columns-conditions – Nikko Apr 27 '15 at 17:06
  • Sorry to revive this old question, but I need some improvement in the speed, is there any way to restring the search to the closest 200 rows? instead of the whole array. (The columns are sorted, and the values of the different files should be similar, so I know that the closest value will be within the 10 or 20 rows, just to be sure I'd like to extended to the closest 200 (100 up maybe and 100 down, or something like that)) – Nikko Feb 07 '16 at 21:28
  • @Nikko I'm at home and close to go to bed, can you craft a Q with definition and information about the input size ? Maybe there's better tooling than awk, anyway there's a better algorithm is entries are sorted (no need to parse full second array, so we should be able to loops only within the range -100/+100 – Tensibai Feb 07 '16 at 21:50
  • OK I will! I'll give you the link tomorrow thanks again! – Nikko Feb 07 '16 at 22:33
  • done! http://stackoverflow.com/questions/35270899/closest-value-in-the-100-rows-different-files-bash-awk-other – Nikko Feb 08 '16 at 13:32