Remove records with same cross product values in AWK

Question

When I do cross product of records in file.txt > file2.txt using command :

join file1.txt{,} -j999 > file2.txt

I get each record in file1.txt with all the records in file1.txt such as:

sample dataset

r1
r2
r3

I get

r1 r1
r1 r2
r1 r3
r2 r1
r2 r2
r2 r3
r3 r1
r3 r2
r3 r3

I dont want r1 r1, r2 r2, records and so on...

If its possible right while doing cross product, how do I get expected results? if not, How do I remove the records after processing join file1.txt{,} -j 999

I tried this with another awk command :

if($i!=$(i+12)){print $0;} and

if($1!=$13){print $0;}

Because I have serial number of each record 1,2,3,... I have file2.txt as :

c1  c13 --> column 1 and column 13
1   1
1   2
1   3
1   4
2   1
2   2
2   3
2   4
3   1
3   2
3   3
3   4

I simply compare serial numbers and if they are not equal print those records. but I get undesired results, such as:

You can see it skips all the records before $1!=$13. so there are rows missing like:

2  1
3  1
3  2

it should only skip the records that are in pattern r1 r1, r2 r2,...

Update

1st and 13th col is serial numbers.

Please update showing a [mcve] - and with text, not an image. Otherwise it is hard to work on the solution. — fedorqui, Jun 20 '16 at 12:30

score 1 · Answer 1 · answered Jun 20 '16 at 11:39

1

Just loop twice through the file:

awk 'FNR==NR {a[FNR]=$0; next}
     BEGINFILE{lines=NR-FNR}
     {
       for (i=1;i<=lines;i++) {
           if (i!=FNR) print $0, a[i]
       }
     }' file file

This stores the data in an array a[line_number]=value_on_that_line when reading the first time. Then, when reading for the second time it just loops through the number of lines printing all the pairs except when the line number matches the index - that is, when they map to the same line.

For your given file with r1, r2, r3 it returns:

$ awk 'FNR==NR {a[FNR]=$0; next} BEGINFILE{lines=NR-FNR} {for (i=1;i<=lines;i++) { if (i!=FNR) print $0, a[i]}}' f f
r1 r2
r1 r3
r2 r1
r2 r3
r3 r1
r3 r2

answered Jun 20 '16 at 11:39

fedorqui

275,237
103
548
598

I am trying, but I have 4,000,000 records. and I have 26 cols in total where r1 and r2 has 16 cols each And it is taking alot of time. is there an efficient way to remove than the one you provided? like just compare $i == $(i+12). if equal that is, in each record if a record has first 13 fields matching next 13 fields. remove the line? or print nothing so its removed? – Murlidhar Fichadia Jun 20 '16 at 12:10
@MurlidharFichadia: Are the column numbers fixed for both? like col 1 and 10 ? – Inian Jun 20 '16 at 12:14
@Inian please check the image – Murlidhar Fichadia Jun 20 '16 at 12:20
Are we sure, that the file contents have no repetitions? – anishsane Jun 20 '16 at 12:20

Remove records with same cross product values in AWK

sample dataset

I get

Update

1 Answers1