1

I have a file with coordinates (interval) and another one with locations (points). I want to find all points that overlap with the intervals

Example Locations

# Locations
chr1 99
chr1 100
chr1 101
chr1 102
...
chr2 120
chr2 121
chr2 122
chr2 123

# coordinate
[('chr1', 98, 120, 'CSF3R'), (...), ...]

What I want is to ignore the locations that are outside my coordinates. So the result would be

chr1 99
chr1 100
chr1 101
chr1 102
...
chr1 120

This could be very simple by using a program called Bedtools (for linux) but for many reasons, I cannot use it and I am looking for alternatives.

Originally I developed this

def roi(gvcf_data,bedregions):
    '''Interset. Take ROI of the gVCF using
       the bed file
       gvcf_data is location, a dataframe with three colums Chr,
       position and something else
       bedregions is coordiantes in the format shown above

    '''
    rows = []
    for region in bedregions:
        for index, row in gvcf_data.iterrows():
            if (region[0] == row['CHROM']) & (row['POS'] in range (region[1],region[2])):
                rows.append([row['CHROM'], row['POS'],row['DP'], region[3]])
    return pd.DataFrame(rows, columns=["CHROM", "POS", "DP", "Probe_name"])

This works and returns what I want however it takes years because bedregions has 500 tuples (one per coordinates) and gvcf_data has 108552 rows.

I have been thinking of alternatives and I have found https://pypi.org/project/intervaltree/

what is ideal for this. The problem I have is the chromosomes. I don't know how to apply that filter in an efficient way and then apply intervaltree.

So far, I have done this

def roi2(gvcf_data,bedregions):

    tree = intervaltree.IntervalTree() # Initialize an empty tree

    for region in bedregions:
        # Feed the tree with my coordinates
        tree.addi(int(region[1]), int(region[2]), region[3])

        for index, row in gvcf_data.iterrows():
            if (region[0] == row['CHROM']):
                tree[row['POS']]
# This doesn´t work

Using or not using intervaltree. Is there a way to do this? I have been working in programming no more than a few months and these is at the moment very complex for me.

1 Answers1

0

IIUC you don't need to loop through your df row by row multiple times.
Loop through each element of your bedregions, get a filtered version of gvcf_data with all locations inbetween the upper and lower boundaries, add a column with the probe_name and append it to a list. At the end you concat all those dfs in the list together to one dataframe (is one big df really needed?). According to your steps after that you could also save each chunk of dataframe as value to a key (which would be the probe_name I guess) in a dictionary.

The input I used is at the end of the answer.

list_of_df_chunks = []
for _, lower_num, upper_num, probe in bedregions:
    chunk = gvcf_data.loc[(gvcf_data['POS'] >= lower_num) & (gvcf_data['POS'] <= upper_num), :].copy()
    chunk.loc[:,'Probe_name'] = probe
    
    list_of_df_chunks.append(chunk)
    
result = pd.concat(list_of_df_chunks, ignore_index=True)

Output result:

   CHROM  POS  DP Probe_name
0   chr1   99   3      CSF3R
1   chr1  100   8      CSF3R
2   chr1  101   6      CSF3R
3   chr1  102   4      CSF3R
4   chr1  120  10      CSF3R
5   chr1  100   8      WS590
6   chr1  101   6      WS590
7   chr1  102   4      WS590
8   chr1  120  10      AB345
9   chr1  121   4      AB345
10  chr1  122   5      AB345
11  chr1  120  10      LL440
12  chr1  121   4      LL440
13  chr1  122   5      LL440
14  chr1  123   0      LL440

Used Input:

gvcf_data = pd.DataFrame(
    {'CHROM': ['chr1', 'chr1', 'chr1', 'chr1', 'chr1', 'chr1', 'chr1', 'chr1'],
     'POS': [99, 100, 101, 102, 120, 121, 122, 123],
     'DP': [3, 8, 6, 4, 10, 4, 5, 0]}
)

bedregions = [
    ('chr1', 98, 120, 'CSF3R'), 
    ('chr1', 100, 109, 'WS590'), 
    ('chr1', 105, 122, 'AB345'), 
    ('chr1', 120, 125, 'LL440')]
Rabinzel
  • 7,757
  • 3
  • 10
  • 30
  • I think that in CHROM column, there is more than one chromosome (e.g. chr1 and chr2) – Manolo Dominguez Becerra Jul 30 '22 at 11:12
  • then the minimal reproducible example is little to minimal :) I thought about that too, but just from the data we got, we can't know – Rabinzel Jul 30 '22 at 11:16
  • Sorry for the mistake. Manolo Dominguez Becerra is right. There many many chr number and also chrX and chrY – tere becerra Jul 30 '22 at 12:52
  • I have done so work using only python (no external libraries) but still having error. I have asked that in a second question here https://stackoverflow.com/questions/73175881/overlapping-points-in-a-interval-having-conditions – tere becerra Jul 30 '22 at 13:09
  • why you just don't add all information we need to the question? Making a new question at that point is just useless – Rabinzel Jul 30 '22 at 13:19
  • what even means "still having error"? so my solution doesn't work for you? – Rabinzel Jul 30 '22 at 13:29