Find the closest values: Multiple columns conditions

Question

Following my first question here I want to extend the condition of find the closest value from two different files of the first and second column, and print specific columns.

File1

1 2 3 4 a1
1 4 5 6 b1
8 5 9 11 c1

File 2

So for example I need to find the closest value from $1 in file 2 for each $1 in file 1 but then search the closest also for $2.

Output:

* First column file 1 and 2nd column file 2 because for the 1st column (of file 1) the closest value (from the 1st column of file 2) is 1, and the 2nd condition is that also must be the closest value for the second column which is this case is 2. And I print $1,$2,$5 from file 1 and $1,$2,$4 from file 2

For the other output is the same procedure.

The solution to find the closest it is in my other post and was given by @Tensibai. But any solution will work. Thanks!

So how big are those files? And may I ask you why do you insist on bash/awk? Real world use case? Of course it is possible in awk since it's Turing complete, but I don't see any advantage with using awk (just look at the solution of your other question, it's not one pass or anything), and using a more capable language like Python or Perl would make the program more readable. — 4ae1e1, Apr 27 '15 at 18:29
The only reason is just to keep consistency with my previous routines.. And also considering that I have no idea about python (not like I am expert on bash or awk, but still). And the files are less than 2000 lines. — Nikko, Apr 27 '15 at 19:41
(You'd better @ mention me or I won't see your comment unless I come back.) Okay, Python is very intuitive and easy to learn; if you have prior experience in other languages (especially in the C family), I bet you can pick up Python in two hours to hack up a script that solves your problem. And the result will be much more maintainable, considering how intuitive Python is and the percentage of developers who know Python (compared to awk, which even comes in different flavors). — 4ae1e1, Apr 27 '15 at 19:56
You can always call the Python interpreter from a Bash script anyway (but I would recommend Python scripts), and there won't be much performance hit with starting the Python interpreter unless you have a hoard of these files to process (in which case, believe it or not, you can also compile Python). — 4ae1e1, Apr 27 '15 at 19:56
@4ae1e1 ok man, then do you think a solution for this in python? — Nikko, Apr 27 '15 at 19:59
I mean, it's better for you to read a tutorial and roll your own; that way you'll learn a lot and won't need to ask for help for similar problems in the future. There's no complicated algorithm here right? Just basic flow control. I can write one for you of course, but what would you learn in that case? — 4ae1e1, Apr 27 '15 at 20:30
I don't gentle last line of expected output, shouldn't it be ´9 5 g` instead of ´9 6 2 h` ? — Tensibai, Apr 27 '15 at 20:34
@4ae1e1 any text manipulation solution you write in Python can be written more clearly, briefly, portably (since awk exists in all UNIX installations), easily maintained, and with better performance in any POSIX awk and even more so with GNU awk. awk is just stripped down C optimized for text processing with associative arrays and an implicit `while read line` loop so it's trivial to learn. You can compile awk too but the result won't have noticeably, if any, better performance than leaving it interpreted. — Ed Morton, Apr 27 '15 at 22:28
@EdMorton I respect your opinion, but I won't really call this question "text manipulation". If this is "text manipulation", then any read input, process, and write output can be called text manipulation too (which is anything). — 4ae1e1, Apr 27 '15 at 22:30
@4ae1e1 this IS simply text manipulation. What is not text manipulation is when you start moving files around and/or starting/stopping processes and pipes and all that other stuff that shell is for and if you have that sort of problem then you should start looking at perl (or python?) instead of shell+awk. — Ed Morton, Apr 27 '15 at 22:32
@EdMorton There's little moving files around or starting/stopping processes and pipes in this question. But this question involves a whole lot of global comparisons, and the only "text manipulation" here is dividing each line into columns, which can be trivially done in a lot of languages. I like shell scripts and standard Unix utilities a lot, but I don't see much advantage of them in this question, especially when the OP is not familiar enough with the utilities to write a script for himself. — 4ae1e1, Apr 27 '15 at 22:36
(As I said, of course you can call any reading from a file and writing to a file text manipulation. But obviously you ignored all the complication in between, and what's in between is called "programming".) Of course no one stops you from posting an awk solution. — 4ae1e1, Apr 27 '15 at 22:38
Right, there's no manipulating files/processes/pipes which is why I'm saying it's just text manipulation. It sounds like we have a different philosophy - mine is to do it in a standard UNIX utility (awk) unless there's an advantage to using some other tool whereas you seem to be suggesting using some other tool (python) because there's no advantage you can see to doing it in the standard UNIX utility. Doing many global comparisons or anything else is no more difficult in awk than python or any other language but I'm lazy and I don't understand the question well enough to suggest the solution. — Ed Morton, Apr 27 '15 at 22:45
@EdMorton "It sounds like we have a different philosophy - mine is to do it in a standard UNIX utility (awk) unless there's an advantage to using some other tool." Well, I would use standard Unix utilities too when it's clearly sufficient, but suggesting a solution to one who can't do it either way and is asking for help is kind of different. Python (or Perl, which I also mentioned and then got ignored, probably due to reputation) is more portable, more capable in many cases (STL, PyPI, CPAN, etc), and no harder than awk to pick up, so that's why I'm suggesting it as something good to learn. — 4ae1e1, Apr 28 '15 at 00:35
@4ae1e1 I really do think here the language is not the problem, you first need to be able to write the algorithm, stepping from the algorithm to code is a matter of translation after that and all drawbacks from each language apply. Given the problem here which is searching closest values in a multidimensionnal array, I'm unsure any language would be simple than another. — Tensibai, Apr 28 '15 at 08:41

score 1 · Accepted Answer · answered Apr 28 '15 at 08:38

Sounds a little convoluted but works:

function closest(array,searched) {
  distance=999999; # this should be higher than the max index to avoid returning null
  split(searched,skeys,OFS)
  # Get the first part of key
  for (x in array) { # loop over the array to get its keys
    split(x,mkeys,OFS) # split the array key
    (mkeys[1]+0 > skeys[1]+0) ? tmp = mkeys[1] - skeys[1] : tmp = skeys[1] - mkeys[1] # +0 to compare integers, ternary operator to reduce code, compute the diff between the key and the target
    if (tmp < distance) { # if the distance if less than preceding, update
      distance = tmp
      found1 = mkeys[1] # and save the key actually found closest
    }
  }
  # At this point we have the first part of key found, let's redo the work for the second part
  distance=999999;
  for (x in array) {
    split(x,mkeys,OFS)
    if (mkeys[1] == found1) { # Filter on the first part of key
      (mkeys[2]+0 > skeys[2]+0) ? tmp = mkeys[2] - skeys[2] : tmp = skeys[2] - mkeys[2] # +0 to compare integers, ternary operator to reduce code, compute the diff between the key and the target
      if (tmp < distance) { # if the distance if less than preceding, update
        distance = tmp
        found2 = mkeys[2] # and save the key actually found closest
      }

    }
  }
  # Now we got the second field, woot
  return (found1 OFS found2)  # return the combined key from out two search
}

{
   if (NR>FNR) { # If we changed file (File Number Record is less than Number Record) change array

     b[($1 OFS $2)] = $4 # make a array with "$1 $2" as key and $4 as value
   } else {
     key = ($1 OFS $2) # Make the key to avoid too much computation accessing it later
     akeys[max++] = key # store the array keys to ensure order at end as for (x in array) does not guarantee the order
     a[key] = $5 # make an array with the key stored previously and $5 as value
   }

}

END { # Now we ended parsing the two files, print the result
  for (i in akeys) { # loop over the array of keys which has a numeric index, keeping order
    print akeys[i],a[akeys[i]] # print the value for the first array (key then value)
    if (akeys[i] in b) { # if the same key exist in second file
      print akeys[i],b[akeys[i]] # then print it
    } else {
      bindex = closest(b,akeys[i]) # call the function to find the closest key from second file
      print bindex,b[bindex] # print what we found
    }
  }
}

Note I'm using OFS to combine the fields so if you change it for output it will behave properly.

WARNING: This should do with relative short files, but as now the array from second file is traversed twice it will be twice long for each searchEND OF WARNING

There's place for a better search algorithm if your files are sorted (but it was not the case on previous question and you wished to keep the order from the file). First improvement in this case, break the for loop when distance start to be greater than preceding one.

Output from your sample files:

$ mawk -f closest2.awk f1 f2
1 2 a1
1 2 b
1 4 b1
1 4 d
8 5 c1
9 5 g

Ping @EdMorton: I would love your insight on what could be improved there (given the keys are not sorted). — Tensibai, Apr 28 '15 at 08:44
@Tensibai I've read the question a couple of times but still don;t understand it given how much effort I'm willing to put into it but in general: use gawk for true 2D arrays so you can do `a[key1][key2]` and then you don't need to create and later split a compound key. If you require compound keys you shouldn't use `OFS` as the subscript separator since it could be part of a field, use the default `SUBSEP` or `FS` instead. Also with gawk, you can set `PROCINFO["sorted_in"]` so when you do `for (key in array)` later it is traversed in the order you want. — Ed Morton, Apr 28 '15 at 12:35
and for min/mac calculations don't pick some arbitrary number you think will be out of range (`distance=999999`) but instead always seed with the first value, e.g. `distance=""; if ((tmp < distance) || (distance="")) ...`. — Ed Morton, Apr 28 '15 at 12:37
@EdMorton thanks for the advices, the use of OFS is voluntary here according to the input and the desired output. Does "sorted_in" allow to ensure the array is ordered by input time (i.e. same order as the file lines, not sure I'm clear on this one) ? — Tensibai, Apr 28 '15 at 12:39
@Tensibai `sorted_in` tells gawk what order to access the array elements based on the array indices or contents and whether they should be treated numerically or as strings. See http://www.gnu.org/software/gawk/manual/gawk.html#Controlling-Scanning. It also allows you to write your own function to decide the order. If you want to access an array `b[$1]=$2` in the order it was populated you'd use something like `a[++cnt]=$1; b[$1]=$2; ... for (i=1;i in a;i++) print b[a[i]]`. You might find `asort(a,b)` or `asorti(a,b)` useful for this too since they create a sorted array for you. — Ed Morton, Apr 28 '15 at 12:43
@EdMorton Ok so I'm right in my use of a numeric based array to store the other keys in order. Side note on your last example `for (i in a)` isn't enought ? I don't understand the need for i=1 and i++ in this case — Tensibai, Apr 28 '15 at 12:49
By default `for (i in a)` will traverse `a` in the order in which it is stored in the internal hash table. You can consider that a random order for all practical purposes. If you want to traverse `a` in the specific numeric order of contiguous indices then in gawk you can do `PROCINFO["sorted_in"]="@ind_num_asc"; for (i in a)` or `for (i=1;i in a;i++)` in any awk or `for (i=1;i<=length(a);i++)` in gawk, or `for (i=1;i<=cnt;i++)` in any awk if you have a `cnt` variable that tells you how many entries `a` has. — Ed Morton, Apr 28 '15 at 12:53
At another glance, and still without really understanding what the whole question is about, it looks to me like you should be able to use `asort()` at the start of the `END` section outside of the loop that calls your `closest()` function to create a sorted array and then use Binary Search inside `closest()` to find the value closest to a key instead of walking the whole array. That should be significantly faster. There's other sorting algorithms that might speed things up too, see http://awk.info/?doc/sorting.html for the awk implementation of some of them. — Ed Morton, Apr 28 '15 at 13:04
@EdMorton I sticked to it as the original question had random order index and specified to keep this order in output. That's my note in the answer about it under the Warning. But indeed it could be done anyway aside of the 'work' arrays to speed up the search. I'll see if I've time to improve this later. — Tensibai, Apr 28 '15 at 13:11

Find the closest values: Multiple columns conditions

1 Answers1

Linked