3

I really need some advice on what data structure and functions to use to solve a task I'm trying to perform. I'm just not sure of the best approach here.

The problem/task: I have a list of chromosomal start and end positions. I'm trying to figure the best way to push this data into a list of tuples(?) or something similar then bisect these coordinates given a start_end range value.. I have used bisect before, but only for lists containing a single value entries so just not sure what the best way is to approach multi-value comparisons.

For example, if I have the genes below,

gene_name start_pos end_pos
gene_A 100 200
gene_B 300 400
gene_C 500 600
gene_D 700 800
gene_E 900 1000

and I want to query this list with a start and end position that don't match the normal start and end to return the matching gene;

query_start = 550 query_end = 580 > should return gene_C 
query_start = 110 query end = 180 > should return gene_A

I have tried to plough my way through and have made some ridiculously ugly complicated code but I know there must be a simple/logical way to do this and I'm struggling to ask the right questions documentation/forum-searching wise.

Any helpful advice would be greatly appreciated.

Thanks

David Greydanus
  • 2,551
  • 1
  • 23
  • 42
user1995839
  • 737
  • 2
  • 8
  • 19

2 Answers2

1

Placing these values into a dictionary is the natural thing to do. Here I have used the gene names as the dictionary keys, and their corresponding ranges as the values.

genes={'gene_A': [100,200],
    'gene_B': [300, 400],
    'gene_C': [500, 600],
    'gene_D': [700, 800],
    'gene_E': [900, 1000]}

#Takes as argument a dictionary of genes to check and a range in the form of a tuple
def gene_query(gene_data,gene_range):
    for gene in gene_data:
        if gene_range[0]>=gene_data[gene][0]:
            if gene_range[1]<=gene_data[gene][1]:
                return gene
    else:
        return "No genes match your query range"


print gene_query(genes, (550, 580))
print gene_query(genes, (110, 180))

Here I have made a python function to return the name of the first gene to match the query range, but you could easily modify it to collect all the result that matched a query by appending the matching results to a list rather than immediately returning them.

David Greydanus
  • 2,551
  • 1
  • 23
  • 42
1

First, here are all the data in a list of tuples:

>>> txt='''\
... gene_name start_pos end_pos
... gene_A 100 200
... gene_B 300 400
... gene_C 500 600
... gene_D 700 800
... gene_E 900 1000'''
>>> 
>>> genes=[(name, int(d1), int(d2)) for name, d1, d2 in [line.split() for line in txt.splitlines()[1:]]]
>>> genes
[('gene_A', 100, 200), ('gene_B', 300, 400), ('gene_C', 500, 600), ('gene_D', 700, 800), ('gene_E', 900, 1000)]

Once you have that, for your simple example, you can use filter:

def query(genes, start, finish):
    return list(filter(lambda t: t[1]<start<t[2] and t[1]<finish<t[2], genes))

>>> query(genes, 550, 580)
[('gene_C', 500, 600)]
>>> query(genes, 110, 180)
[('gene_A', 100, 200)]

Or a list comprehension:

def query(genes, start, finish):
    return [t[0] for t in genes if t[1]<start<t[2] and t[1]<finish<t[2]]

>>> query(genes, 550, 580)
['gene_C']
>>> query(genes, 110, 180)
['gene_A']

Or you can use the bisect module (if genes is a sorted list).

First sort the list:

>>> genes.sort(key=lambda t: (t[1], t[2]))
>>> genes
[('gene_A', 100, 200), ('gene_B', 300, 400), ('gene_C', 500, 600), ('gene_D', 700, 800), ('gene_E', 900, 1000)]

Produce a list of key tuples that you can use as an index:

>>> keys=[(t[1], t[2]) for t in genes]
>>> keys
[(100, 200), (300, 400), (500, 600), (700, 800), (900, 1000)]

Now you can query genes with the key index and bisect:

>>> import bisect
>>> genes[bisect.bisect_left(keys, (550, 580))-1]
('gene_C', 500, 600)
>>> genes[bisect.bisect_left(keys, (110, 180))-1]
('gene_A', 100, 200)

For more complex examples, you might consider the SortedCollection recipe.

dawg
  • 98,345
  • 23
  • 131
  • 206