I have a file like this having about 20 million rows.
11.tsv SDSS01001000.1 M1 100021
11.tsv SDSS01001000.1 M1 100082
11.tsv SDSS01001000.1 M1 100140
11.tsv SDSS01001000.1 M1 100270
11.tsv SDSS01001000.1 M1 100634
11.tsv SDSS01001000.1 M1 100849
11.tsv SDSS01001000.1 M1 100865
11.tsv SDSS01001000.1 M1 101037
11.tsv SDSS01001000.1 M1 101086
11.tsv SDSS01001000.1 M1 101164
11.tsv SDSS01001000.1 M1 101203
11.tsv SDSS01001000.1 M1 101338
11.tsv SDSS01001000.1 M1 101844
11.tsv SDSS01001000.1 M1 102117
11.tsv SDSS01001000.1 M1 102224
I need to check while second column is same, the value in 3rd column is more than 80 from previous row and less than 80 from next row. An example table is as below.
[![enter image description here][1]][1]
Final expected result are those where neither 5th or 6th column is less than 80
So the required table will look like this
11.tsv SDSS01001000.1 M1 100270
11.tsv SDSS01001000.1 M1 100634
11.tsv SDSS01001000.1 M1 101338
11.tsv SDSS01001000.1 M1 101844
11.tsv SDSS01001000.1 M1 102117
11.tsv SDSS01001000.1 M1 102224
In short I want to filter lines where the value in 4th column has difference of more than 80 from the previous and next line. This is a biological question and the 2nd and 4th column correspond to chromosome number and chromosomal location. I need to filter those locations that do not have any other reported location in 80 base vicinity.
Thanks in advance for your help.
Presently I added and subtracted 80 from the position making it like a bed file with start and end position and then use this command to check in the file if there there is no location within the start and end.
arr=($1);
echo -e "${arr[0]}\t${arr[1]}\t`awk -v a=${arr[1]} -v b=$((${arr[1]}-50)) -v c=$((${arr[1]}+50)) '$4>b && $4<c && $4!=a {print $2,$4}' tmp_Screening/${arr[0]}|sort -u|awk 'END {print NR}'`"
But this is time consuming. It takes a few days to process each file. please help [1]: https://i.stack.imgur.com/u48Vx.png [2]: https://i.stack.imgur.com/3EPGe.png