-2

I have a file like this having about 20 million rows.

11.tsv       SDSS01001000.1 M1  100021
11.tsv       SDSS01001000.1 M1  100082
11.tsv       SDSS01001000.1 M1  100140
11.tsv       SDSS01001000.1 M1  100270
11.tsv       SDSS01001000.1 M1  100634
11.tsv       SDSS01001000.1 M1  100849
11.tsv       SDSS01001000.1 M1  100865
11.tsv       SDSS01001000.1 M1  101037
11.tsv       SDSS01001000.1 M1  101086
11.tsv       SDSS01001000.1 M1  101164
11.tsv       SDSS01001000.1 M1  101203
11.tsv       SDSS01001000.1 M1  101338
11.tsv       SDSS01001000.1 M1  101844
11.tsv       SDSS01001000.1 M1  102117
11.tsv       SDSS01001000.1 M1  102224

I need to check while second column is same, the value in 3rd column is more than 80 from previous row and less than 80 from next row. An example table is as below.

[![enter image description here][1]][1]

Final expected result are those where neither 5th or 6th column is less than 80

So the required table will look like this

11.tsv  SDSS01001000.1  M1  100270
11.tsv  SDSS01001000.1  M1  100634
11.tsv  SDSS01001000.1  M1  101338
11.tsv  SDSS01001000.1  M1  101844
11.tsv  SDSS01001000.1  M1  102117
11.tsv  SDSS01001000.1  M1  102224

In short I want to filter lines where the value in 4th column has difference of more than 80 from the previous and next line. This is a biological question and the 2nd and 4th column correspond to chromosome number and chromosomal location. I need to filter those locations that do not have any other reported location in 80 base vicinity.

Thanks in advance for your help.

Presently I added and subtracted 80 from the position making it like a bed file with start and end position and then use this command to check in the file if there there is no location within the start and end.

arr=($1);
echo -e "${arr[0]}\t${arr[1]}\t`awk -v a=${arr[1]} -v b=$((${arr[1]}-50)) -v c=$((${arr[1]}+50)) '$4>b && $4<c && $4!=a {print $2,$4}' tmp_Screening/${arr[0]}|sort -u|awk 'END {print NR}'`"

But this is time consuming. It takes a few days to process each file. please help [1]: https://i.stack.imgur.com/u48Vx.png [2]: https://i.stack.imgur.com/3EPGe.png

onkar
  • 29
  • 7
  • 6
    What have you tried to solve the problem by yourself, and what didn't work? – GMB Oct 03 '20 at 16:01
  • 1
    how does the last row (`102224`) get into the final answer if it has no `next` row? – markp-fuso Oct 03 '20 at 16:54
  • Yes, I have checked some solutions, but I am not very expert in this thus could not solve it. – onkar Oct 03 '20 at 23:02
  • I could do the comparison with previous but not with next. for previous comparison the command was like tthis awk 'NR==1 {a=$4; next} {print $4 - a; a=$4}' < <(head -n5 file11) this prints value only, but I can fix it to print lines once working fine. Even if the last row is not considered that's fine, I can check it manually. The actual problem is biological, thus the last one lies at the end of chromosome thus it do not have any other snp at 80 bases near it, thats why considered. – onkar Oct 03 '20 at 23:11
  • This is one question only. I need to find locations that don't have any other location in 80 base vicinity. I tried making the question simple to understand please let it go. I need urgent help – onkar Oct 03 '20 at 23:18
  • `myfun() { perl -ane 'if($F[3]-$pos[1] > 80 and $pos[1]-$pos[0] > 80) { print $last_line; } $last_line=$_;push(@pos,$F[3]);$#pos > 1 and shift @pos; END{if($pos[1]-$pos[0] > 80) { print $last_line; }}'; }; export -f myfun; cat bigfile | parallel --pipe --group-by 2 myfun ` – Ole Tange Oct 04 '20 at 00:19
  • This processes 20 MB/sec/CPU thread on my machine. – Ole Tange Oct 04 '20 at 00:27
  • Also, please don't post images of text; see https://meta.stackoverflow.com/questions/303812/discourage-screenshots-of-code-and-or-errors – tripleee Oct 05 '20 at 07:14
  • you should clean up your question, your write about 2nd, 3rd, 4th, 5th, 6th columns seemingly interchangeably. – Kjetil S. Oct 08 '20 at 02:26

2 Answers2

0
#!/usr/bin/perl
use warnings; use strict;
sub second_col {(split(/\s+/,shift))[1]}
sub fourth_col {(split(/\s+/,shift))[3]}
my($prev,$this,$next);
while(<>){
    ($prev,$this,$next)=($this,$next,$_);
    next if not defined $prev;
    next if second_col($prev) ne second_col($this);
    next if second_col($this) ne second_col($next);
    next if fourth_col($this) - fourth_col($prev) < 80;
    next if fourth_col($next) - fourth_col($this) < 80;
    print $this;
}

Run:

chmod +x program.pl
./program.pl datafile
Kjetil S.
  • 3,468
  • 20
  • 22
0
myfun() {
  perl -ane 'if($F[3]-$pos[1] > 80 and $pos[1]-$pos[0] > 80) { print $last_line; }
             $last_line=$_;
             push(@pos,$F[3]);
             $#pos > 1 and shift @pos;
             END{if($pos[1]-$pos[0] > 80) { print $last_line; }}';
}
export -f myfun
cat bigfile | parallel -N1 --pipe --group-by 2 myfun

This processes 20 MB/s on my laptop.

It uses GNU Parallel's --group-by which looks at column 2 and for each block where the value of column 2 is the same, run myfun on that block.

Ole Tange
  • 31,768
  • 5
  • 86
  • 104