Bash: find all pair of lines such that the difference of their first field is less than a threshold

Question

my problem is the following. I have a BIG file with many rows containing ordered numbers (repetitions are possible)

I would like to print all the pair of row numbers such that their difference is smaller than a given threshold thr. Let us say for instance thr=2.01, I want

I wrote a thing in python but the file is huge and I think I need a smart way to do this in bash. Actually, in the complete data structure there exists also a second column containing a string:

1 s0
1.5 s1
3 s2
3.5 s3
6 s4
6 s5
...
1504054 sN-1
1504056 sN

and, if easy to do, I would like to write in each row the pair of linked strings, possibly separated by "|":

s0|s1
s0|s2
s1|s2
s1|s3
s2|s3
s4|s5
...
sN-1|sN

Thanks for your help, I am not too familiar with bash

`I am not too familiar with bash` Why choose to use it for this task then?, Seriously though can you show what you have tried? — Andreas Louv, Nov 03 '16 at 15:24
1) Because the files of this type which I have are huge. And need to do this operation on all such files, possibly for different values of thr. Hence I need something fast. — user2382948, Nov 03 '16 at 15:35
2) Sure. It's in python\n fname=sys.argv[1]\n lines = open(fname).readlines()\n all_time=[]\n for l in lines:\n elems = [float(x) for x in l.strip(" \n").split(" ")]\n all_time.append(elems[0])\n pair_times=[]\n for u in range (len(all_time)-1):\n for v in range (len(all_time)):\n if (all_time[v]-all_time[u]<300.0001): #this is thr\n riga=[int(all_time[u]),int(all_time[v])]\n pair_times.append(riga)\n else:\n break\n and then I print pair_times (I consider that first column as a column of times) — user2382948, Nov 03 '16 at 15:37

score 1 · Answer 1 · answered Nov 03 '16 at 15:37

In any language you can white a program implementing this pseudo code:

while read line:
    row = line.split(sep)
    new_kept_rows = []
    for kr in kept_rows :
      if abs(kr[0], row[0])<=thr:
         print "".join(kr[1:]) "|" "".join(row[1:])
         new_kept_rows.append(kr)
    kept_rows = new_kept_rows

This program only keep the few lines which could match the condition. All other are freed from memory. So the memory footprint should remain small even for big files.

I would use awk language because I'm comfortable with. But python would fit too (the pseudo code I give is very close to be python).

Bash: find all pair of lines such that the difference of their first field is less than a threshold

1 Answers1