I have a file with coordinates (interval) and another one with locations (points). I want to find all points that overlap with the intervals
Example Locations
# Locations
chr1 99
chr1 100
chr1 101
chr1 102
...
chr2 120
chr2 121
chr2 122
chr2 123
# coordinate
[('chr1', 98, 120, 'CSF3R'), (...), ...]
What I want is to ignore the locations that are outside my coordinates. So the result would be
chr1 99
chr1 100
chr1 101
chr1 102
...
chr1 120
This could be very simple by using a program called Bedtools (for linux) but for many reasons, I cannot use it and I am looking for alternatives.
Originally I developed this
def roi(gvcf_data,bedregions):
'''Interset. Take ROI of the gVCF using
the bed file
gvcf_data is location, a dataframe with three colums Chr,
position and something else
bedregions is coordiantes in the format shown above
'''
rows = []
for region in bedregions:
for index, row in gvcf_data.iterrows():
if (region[0] == row['CHROM']) & (row['POS'] in range (region[1],region[2])):
rows.append([row['CHROM'], row['POS'],row['DP'], region[3]])
return pd.DataFrame(rows, columns=["CHROM", "POS", "DP", "Probe_name"])
This works and returns what I want however it takes years because bedregions has 500 tuples (one per coordinates) and gvcf_data has 108552 rows.
I have been thinking of alternatives and I have found https://pypi.org/project/intervaltree/
what is ideal for this. The problem I have is the chromosomes. I don't know how to apply that filter in an efficient way and then apply intervaltree.
So far, I have done this
def roi2(gvcf_data,bedregions):
tree = intervaltree.IntervalTree() # Initialize an empty tree
for region in bedregions:
# Feed the tree with my coordinates
tree.addi(int(region[1]), int(region[2]), region[3])
for index, row in gvcf_data.iterrows():
if (region[0] == row['CHROM']):
tree[row['POS']]
# This doesn´t work
Using or not using intervaltree. Is there a way to do this? I have been working in programming no more than a few months and these is at the moment very complex for me.