-1

I have a dataset with 1994 records with 13 fields. I am trying to get the cross product of the dataset below:

Dataset

c1  c2 c3 c4 c5 c6 c7 c8 c9 c10 c11 c12 c13
1    2  5  6  7  3  1  8  5  9   7   3   4
2    4  .  .  .  .  .  .  .  .   .   .   .
3    9  .  .  .  .  .  .  .  .   .   .   .
.    .  .  .  .  .  .  .  .  .   .   .   .
.    .  .  .  .  .  .  .  .  .   .   .   .
1994 .  .  .  .  .  .  .  .  .   .   .   .

output of the cross product would be each record in the dataset in parallel(in continuous column) with all the other records in the dataset. like shown below:

Expected output

c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11 c12 c13 c14 c15 c16 c17 c18 c19 c20 c21 c22 c23 c24 c25 c26
.  .  .  .  .  .  .  .  .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .
.  .  .  .  .  .  .  .  .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .
.  .  .  .  .  .  .  .  .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .
.  .  .  .  .  .  .  .  .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .

When I execute the code : join file{,} -j99 I get both the records of cross product one underneath another. If I apply the same code for records less than 10 then the output is as expected. I tried to change the value of -j to 99999 and 9999999 but no change in the output.

I get output as:

c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11 c12 c13
.  .  .  .  .  .  .  .  .   .   .   .   .
.  .  .  .  .  .  .  .  .   .   .   .   .

So, I have 1994 records, I should get : 1994*1994 = 3,976,036, but I get twice of those rows as the records are one underneath another.

Community
  • 1
  • 1
Murlidhar Fichadia
  • 2,589
  • 6
  • 43
  • 93

1 Answers1

2

A cross join is every row for each row. So tell awk to print the whole file next to each line. Something like

#!/usr/bin/awk -f
{
    cmd = "awk -v LINE='" $0 "' " "'{ printf(\"%s\\t%s\\n\", LINE, $0) }' " \
    FILENAME
    system(cmd)
}

But I would never do this. It's inefficient, invoking awk O(N) times, and it doesn't get you much. I'd import the file into SQLite and use a cross join that gave me a where clause and named columns.

James K. Lowden
  • 7,574
  • 1
  • 16
  • 31
  • actually by doing that I am trying to find the distance between the two. that is, take first record and compare with each record and calculate the euclidean distance and at the end look for the record that is closest to the record and check for the fields say $6. if both records are in the same class field, I would add +1 to the accuracy. I am trying to calculate 1nn accuracy using AWK. but having hard-time figuring the optimal way to do. can you check this link and enlighten me on how to go about it : http://stackoverflow.com/questions/37897154/one-nearest-neighbour-using-awk – Murlidhar Fichadia Jun 20 '16 at 09:20