R, awk, sed: merge bins and output central overlaps and then central + neighboring overlaps

Question

I have two files with some ranges which I want to overlap and and retrieve the results based on exact match and partial match. An example will make it clear.

fileA:

chr1    200     400     E1
chr1    400     600     E2
chr1    600     800     E3
chr2    200     300     E4

fileB:

chr1    100     250   TF1   G1
chr1    250     650   TF2   G2
chr1    450     850   TF3   G3

Output:

chr1    100 250 TF1 G1  chr1    200     400     E1
chr1    250 650 TF2 G2  chr1    200     400     E1
chr1    250 650 TF2 G2  chr1    400     600     E2
chr1    250 650 TF2 G2  chr1    600     800     E3
chr1    450 850 TF3 G3  chr1    400     600     E2
chr1    450 850 TF3 G3  chr1    600     800     E3

Uptil this step I can do things but the next step is for which I need your help.

Here I want to first subset those lines

which are having 1 match only (e.g. row 1 of output file, irrespective to the overlap size)
if there are two matches (e.g. row 5 and 6 of output) then the 'central row' which has the most overlap (that will be row 6 as overlap is 200 as compared to row 5 where overlap is 150)
if there are 3 or more than 3 matches (e.g. row 3 of output which is complete overlap but the row 2 and 4 are neighboring rows which have partial overlap, 150 and 50, respectively) then I want to return only the central row which will be row 3 in this case.

Later, I want to retrieve the first neighbors and then 2nd neigbors and so on, because in actuall datasets it could happen that one bin in file B will overlap with maximum of 5 or 7 bins in file A.

So, basically what I want is first get all the central overlaps, then central + 1st neighbors, then central + 2nd neighbors and so on.

Following this rationale, my first results will be:

Result1 (Central overlaps):

chr1    100 250 TF1 G1  chr1    200     400     E1
chr1    250 650 TF2 G2  chr1    400     600     E2
chr1    450 850 TF3 G3  chr1    600     800     E3

Result2 (Central + 1st neighbor):

chr1    100 250 TF1 G1  chr1    200     400     E1
chr1    250 650 TF2 G2  chr1    200     400     E1
chr1    250 650 TF2 G2  chr1    400     600     E2
chr1    250 650 TF2 G2  chr1    600     800     E3
chr1    450 850 TF3 G3  chr1    400     600     E2
chr1    450 850 TF3 G3  chr1    600     800     E3

If possible, I would like to separately retrieve only the neighboring rows but not the central ones.

Any help will be much appreciated. Thank you.

karakfa · Accepted Answer · 2017-05-12T18:59:07.237

this is not the full solution since I couldn't comprehend additional requirements on my time budget, but perhaps this will get you started.

Assuming the files are sorted by the first key...

join fileB fileA | 
awk '{diff=($3<$7?$3:$7)-($2>$6?$2:$6)} diff>0{print $0, diff}' | 
sort -k1,1 -k9nr | 
awk '!a[$1,$2,$3]++'

chr1 250 650 TF2 G2 400 600 E2 200
chr1 450 850 TF3 G3 600 800 E3 200
chr1 100 250 TF1 G1 200 400 E1 50

the last column shows the overlap amount, perhaps will be useful for the next steps as well.

UPDATE

with slight modification of the last awk you can get the second and third neighbors as well

$ join fileB fileA | ...| awk '!(a[$1,$2,$3]++-1)'
chr1 250 650 TF2 G2 200 400 E1 150
chr1 450 850 TF3 G3 400 600 E2 150

$ join fileB fileA | ... | awk '!(a[$1,$2,$3]++-2)'
chr1 250 650 TF2 G2 600 800 E3 50

in your output you have chr1 250 650 listed three times, perhaps it's a typo or I complete misunderstood what you're trying to do here...

Alternatively, you can mark the order on the records and do filtering based on that.

$ join fileB fileA | ... | awk '{print a[$1,$2,$3]++, $0}' | sort -k1n

0 chr1 100 250 TF1 G1 200 400 E1 50
0 chr1 250 650 TF2 G2 400 600 E2 200
0 chr1 450 850 TF3 G3 600 800 E3 200
1 chr1 250 650 TF2 G2 200 400 E1 150
1 chr1 450 850 TF3 G3 400 600 E2 150
2 chr1 250 650 TF2 G2 600 800 E3 50

here the first column indicates neighbor number, where 0 is the central.

Pulling all together, you can extract the desired fields to separate files

join fileB fileA                            | 
awk '    {diff=($3<$7?$3:$7)-($2>$6?$2:$6)} 
  diff>0 {print $0,diff}'                   | 
sort -k1,1 -k9nr                            | 
awk '{print a[$1,$2,$3]++, $0}'             | 
sort -k1n                                   | 
awk '{file=($1==0)?"central":"neighbor"$1; 
      print $2,$3,$4,$5,$6,$7,$8,$9 > file}'

creates these files.

==> central <==
chr1 100 250 TF1 G1 200 400 E1
chr1 250 650 TF2 G2 400 600 E2
chr1 450 850 TF3 G3 600 800 E3

==> neighbor1 <==
chr1 250 650 TF2 G2 200 400 E1
chr1 450 850 TF3 G3 400 600 E2

==> neighbor2 <==
chr1 250 650 TF2 G2 600 800 E3

Note that, all of this can be combined in one awk script, but I think it's easier to understand (and update if needed) in this form.

Thank you very much @karakfa for taking out the time to help me out. — Newbie, May 13 '17 at 12:26

R, awk, sed: merge bins and output central overlaps and then central + neighboring overlaps

1 Answers1