0

my problem is the following. I have a BIG file with many rows containing ordered numbers (repetitions are possible)

1
1.5
3
3.5
6
6
...
1504054
1504056

I would like to print all the pair of row numbers such that their difference is smaller than a given threshold thr. Let us say for instance thr=2.01, I want

0 1
0 2
1 2
1 3
2 3
4 5
...
N-1 N

I wrote a thing in python but the file is huge and I think I need a smart way to do this in bash. Actually, in the complete data structure there exists also a second column containing a string:

1 s0
1.5 s1
3 s2
3.5 s3
6 s4
6 s5
...
1504054 sN-1
1504056 sN

and, if easy to do, I would like to write in each row the pair of linked strings, possibly separated by "|":

s0|s1
s0|s2
s1|s2
s1|s3
s2|s3
s4|s5
...
sN-1|sN

Thanks for your help, I am not too familiar with bash

Andreas Louv
  • 46,145
  • 13
  • 104
  • 123
user2382948
  • 21
  • 1
  • 3
  • 2
    `I am not too familiar with bash` Why choose to use it for this task then?, Seriously though can you show what you have tried? – Andreas Louv Nov 03 '16 at 15:24
  • 1) Because the files of this type which I have are huge. And need to do this operation on all such files, possibly for different values of thr. Hence I need something fast. – user2382948 Nov 03 '16 at 15:35
  • 1
    I think python should be faster than bash. – Ipor Sircer Nov 03 '16 at 15:36
  • 2) Sure. It's in python\n fname=sys.argv[1]\n lines = open(fname).readlines()\n all_time=[]\n for l in lines:\n elems = [float(x) for x in l.strip(" \n").split(" ")]\n all_time.append(elems[0])\n pair_times=[]\n for u in range (len(all_time)-1):\n for v in range (len(all_time)):\n if (all_time[v]-all_time[u]<300.0001): #this is thr\n riga=[int(all_time[u]),int(all_time[v])]\n pair_times.append(riga)\n else:\n break\n and then I print pair_times (I consider that first column as a column of times) – user2382948 Nov 03 '16 at 15:37

1 Answers1

1

In any language you can white a program implementing this pseudo code:

while read line:
    row = line.split(sep)
    new_kept_rows = []
    for kr in kept_rows :
      if abs(kr[0], row[0])<=thr:
         print "".join(kr[1:]) "|" "".join(row[1:])
         new_kept_rows.append(kr)
    kept_rows = new_kept_rows

This program only keep the few lines which could match the condition. All other are freed from memory. So the memory footprint should remain small even for big files.

I would use awk language because I'm comfortable with. But python would fit too (the pseudo code I give is very close to be python).

Setop
  • 2,262
  • 13
  • 28