One nearest neighbour using awk

Question

This is what I am trying to do using AWK language. I have a problem with mainly step 2. I have shown a sample dataset but the original dataset consists of 100 fields and 2000 records.

Algorithm

1) initialize accuracy = 0

2) for each record r

     Find the closest other record, o, in the dataset using distance formula

To find the nearest neighbour for r0, I need to compare r0 with r1 to r9 and do math as follows: square(abs(r0.c1 - r1.c1)) + square(abs(r0.c2 - r1.c2)) + ...+square(abs(r0.c5 - r1.c5)) and store those distance.

3) One with min distance, compare its c6 values. if c6 values are equal increment the accuracy by 1.

After repeating the process for all the records.

4) finally, Get the 1nn accuracy percentage by (accuracy/total_records) * 100;

Sample Dataset

        c1   c2   c3   c4   c5   c6  --> Columns
  r0  0.19 0.33 0.02 0.90 0.12 0.17  --> row1 & row7 nearest neighbour in c1
  r1  0.34 0.47 0.29 0.32 0.20 1.00      and same values in c6(0.3) so ++accuracy
  r2  0.37 0.72 0.34 0.60 0.29 0.15 
  r3  0.43 0.39 0.40 0.39 0.32 0.27 
  r4  0.27 0.41 0.08 0.19 0.10 0.18 
  r5  0.48 0.27 0.68 0.23 0.41 0.25 
  r6  0.52 0.68 0.40 0.75 0.75 0.35 
  r7  0.55 0.59 0.61 0.56 0.74 0.76 
  r8  0.04 0.14 0.03 0.24 0.27 0.37 
  r9  0.39 0.07 0.07 0.08 0.08 0.89

Code

BEGIN   {
            #initialize accuracy and total_records
            accuracy = 0;
            total_records = 10;
        }


NR==FNR {    # Loop through each record and store it in an array
                for (i=1; i<=NF; i++) 
                {
                     records[i]=$i;
                }
            next             
        }

        {   # Re-Loop through the file and compare each record from the array with each record in a file    
              for(i=1; i <= length(records); i++)
              {
                   for (j=1; j<=NF; j++) 
                   {      # here I need to get the difference of each field of the record[i] with each all the records, square them and sum it up. 
                          distance[j] = (records[i] - $j)^2;
                   }
               #Once I have all the distance, I can simply compare the values of field_6 for the record with least distance.
              if(min(distance[j]))
              {
                  if(records[$6] == $6)
                  {
                        ++accuracy;
                  } 
              }
       }
END{
     percentage = 100 * (accuracy/total_records); 
     print percentage;
}

you mean, fields[i] = print $i; and store all the fields in an array? though fields are independent but once the nearest row is found I will need to find the class_value which will be in field[6]. If I sort each field seperately I have jumbled up the data. Can you explain a bit more on how to go about with your idea? — Murlidhar Fichadia, Jun 18 '16 at 13:23
You incorporated the comments in your description, so now with your edits what is the actual question ? I.e. what difficulty do you have with step 2? — Soren, Jun 18 '16 at 16:58
I dont know how to apply formula for each fields of two records(including squaring and sum). this bit is wrong ->> distance[j] = (records[i] - $j); — Murlidhar Fichadia, Jun 18 '16 at 17:15

karakfa · Accepted Answer · 2016-06-18T22:15:17.143

0

Here is one approach

$ cat -n file > nfile
$ join nfile{,} -j99 | 
  awk 'function abs(x) {return x>0?x:-x}  
           $1<$8 {minc=999;for(i=2;i<7;i++) 
                 {d=abs($i-$(i+7)); 
                  if(d<minc)minc=d} 
                  print $1,minc,$7==$14}' | 
  sort -u -k1,2 -k3r | 
  awk '!a[$1]++{sum+=$3} END{print sum}'

7

due to symmetry you just need to compare n*(n-1)/2 records, easier to set it up with join to prepare all matches and filter out the redundant ones $1<$8, finds the min column distance per record and record the match of the last fields $7==$14, to find the minimum distance for each record sort by first record number and distance, finally get the sum of the matched entries.

Here for your formulation I guess the result will be 100*2*7/10=140% since you're double counting (R1~R7 and R7~R1), otherwise 70%

UPDATE
With the new distance function, the script can be re-written as

$ join nfile{,} -j999 | 
  awk '$1<$8 {d=0; 
              for(i=2;i<7;i++) d+=($i-$(i+7))^2; 
              print $1,d,$7==$14}' | 
  sort -k1,2n -k3r | 
  awk '!a[$1]++{sum+=$3;count++} 
            END{print 100*sum/(count+1)"%"}'

70%

Explanation

cat -n file > nfile create a new file with record numbers. join can't take both files from stdin, so you have to create a temporary file.

join nfile{,} -j999 cross product of records (each record will be joined with every record (similar effect of two nested loops)

$1<$8 will filter out the records to upper triangular section of the cross product (if you imagine it as a 2D matrix).

for(i=2;i<7;i++) d+=($i-$(i+7))^2; calculate the distance square of each record with respect to others

print $1,d,$7==$14 print from record, distance square, and indicator whether last fields match

sort -u -k1,2 -k3r find the min for each record, sort 3rd field reverse so that 1 will be first if there is any.

a[$1]++{sum+=$3;count++} count rows and sum the indicators for each from record

END{print 100*sum/(count+1)"%"} the number of fields is one more than from records, convert to percent formatting.

I suggest to understand what is going on run each piped section in stages and try to verify the intermediate results.

For your real data you have to change the hard coded reference values. Joined field should be more than your field count.

edited Jun 18 '16 at 22:15

answered Jun 18 '16 at 16:04

karakfa

66,216
7
41
56

my bad, please dont rely on the dataset, i madeup my own dataset as the actual dataset is quite huge. dataset wont be symmetry. the values will be anywhere between 0.0 to 1.0 its min-max normalized. And I will update my code as I think I was doing wrong previously. just give me a min, i am updating. and thanks for the response. – Murlidhar Fichadia Jun 18 '16 at 16:09
regardless of the values if R1 has min distance to R7 it implies R7 has min distance to R1. Therefore you need to compare records only to the other records with higher index. – karakfa Jun 18 '16 at 16:20
can you explain your code and how to run it? I dont understand join and the abs() function. I am fairly new to AWK. I get the part R1 min distance to R7 then vice-versa is same. but lets assume, I have 2000 records then for record = 10, i have to compare the records from 1 to 9 and then from record 11 to 2000. right? But I dont agree the fact that I need to look for only higher indexes. – Murlidhar Fichadia Jun 18 '16 at 16:32
Well, you changed the dist computation, so I have to rewrite. Note that with sum squares you don't need abs value. – karakfa Jun 18 '16 at 20:35
Yes, I am really sorry about it, but I myself didnt knew how to go with it in AWK language. the formula can be found here: https://en.wikipedia.org/wiki/Euclidean_distance its euclidean distance and about absolute thing you are right, 0.4-0.9 = (-0.5) which if I square I get +ve integer. so ya we can skip absolute calculation. – Murlidhar Fichadia Jun 18 '16 at 21:41
please dont rely on the distance formula I wrote in the Code, please refer my algorithm to implement the awk script, as it shows the logic and you could implement freely rather than following my coding style, I could be wrong, I have just laid down the pseudocode or blueprint. – Murlidhar Fichadia Jun 18 '16 at 21:53
Ok, i will play around with your script and let you know by today. – Murlidhar Fichadia Jun 19 '16 at 06:28
I am getting 0% for some reason for above data. I kept r1 and those 6 columns as you are looping from i=2 to i < 7, I assume you are ignoring column 1. I guess I am not running the code command correctly. I first executed this cat -n file.data > nfile.data and then ran rest of the script in terminal console. How do I execute your code? you have awk script twice and everything is connected using pipes, so if you could show steps on running the above code for the sample dataset. – Murlidhar Fichadia Jun 19 '16 at 09:35
cat -n file.data > file2.data and I get this 1 0.19 0.33 0.02 0.90 0.12 0.17 and rest of the records as it is? am I doing something wrong. – Murlidhar Fichadia Jun 19 '16 at 09:59
Can someone please explain $1<$8 even more? i dont really understand the usage of upper triangular. Why we need to do that? – Murlidhar Fichadia Jun 20 '16 at 10:23
1

For example, assume you have 3 records, how many comparisons do you need to evaluate? Cross product will give you 3x3=9, which contains redundancies, you don't need to compare the record by itself, so remove 3, also due to symmetry you need half of what remains, That is 3*2/2 = 3. What is the best way to get the filtered list? You want comparison between records with indices 1~2, 1~3, and 2~3. Here comes the condition (first index) < (second index). – karakfa Jun 20 '16 at 13:46
Oh, I see. Thanks :) – Murlidhar Fichadia Jun 20 '16 at 18:27

One nearest neighbour using awk

Algorithm

Sample Dataset

Code

1 Answers1

Linked