I have two files with some ranges which I want to overlap and and retrieve the results based on exact match and partial match. An example will make it clear.
fileA:
chr1 200 400 E1
chr1 400 600 E2
chr1 600 800 E3
chr2 200 300 E4
fileB:
chr1 100 250 TF1 G1
chr1 250 650 TF2 G2
chr1 450 850 TF3 G3
Output:
chr1 100 250 TF1 G1 chr1 200 400 E1
chr1 250 650 TF2 G2 chr1 200 400 E1
chr1 250 650 TF2 G2 chr1 400 600 E2
chr1 250 650 TF2 G2 chr1 600 800 E3
chr1 450 850 TF3 G3 chr1 400 600 E2
chr1 450 850 TF3 G3 chr1 600 800 E3
Uptil this step I can do things but the next step is for which I need your help.
Here I want to first subset those lines
- which are having 1 match only (e.g. row 1 of output file, irrespective to the overlap size)
- if there are two matches (e.g. row 5 and 6 of output) then the 'central row' which has the most overlap (that will be row 6 as overlap is 200 as compared to row 5 where overlap is 150)
- if there are 3 or more than 3 matches (e.g. row 3 of output which is complete overlap but the row 2 and 4 are neighboring rows which have partial overlap, 150 and 50, respectively) then I want to return only the central row which will be row 3 in this case.
Later, I want to retrieve the first neighbors and then 2nd neigbors and so on, because in actuall datasets it could happen that one bin in file B will overlap with maximum of 5 or 7 bins in file A.
So, basically what I want is first get all the central overlaps, then central + 1st neighbors, then central + 2nd neighbors and so on.
Following this rationale, my first results will be:
Result1 (Central overlaps):
chr1 100 250 TF1 G1 chr1 200 400 E1
chr1 250 650 TF2 G2 chr1 400 600 E2
chr1 450 850 TF3 G3 chr1 600 800 E3
Result2 (Central + 1st neighbor):
chr1 100 250 TF1 G1 chr1 200 400 E1
chr1 250 650 TF2 G2 chr1 200 400 E1
chr1 250 650 TF2 G2 chr1 400 600 E2
chr1 250 650 TF2 G2 chr1 600 800 E3
chr1 450 850 TF3 G3 chr1 400 600 E2
chr1 450 850 TF3 G3 chr1 600 800 E3
If possible, I would like to separately retrieve only the neighboring rows but not the central ones.
Any help will be much appreciated. Thank you.