How to refine a python script for a bioinformatics query

Question

I am quite new to python and I would be grateful for some assistance if possible. I am comparing the genomes of two closely related organisms [E_C & E_F] and trying to identify some basic insertions and deletions. I have run a FASTA pairwise alignment (glsearch36) using sequences from both organisms.

The below is a section of my python script where I have been able to identify a 7 nucleotide (heptamer) in one sequence (database) that corresponds to a gap in the other sequence (query). This is an example of what I have:

ATGCACAA-ACCTGTATG # query
ATGCAGAGGAAGAGCAAG # database
9
GAGGAAG

Assume the gap is at position 9. I am trying to refine the script to select gaps that are 20 nucleotides or more apart on both sequences and only if the surrounding nucleotides also match

ATGCACAAGTAAGGTTACCG-ACCTGTATGTGAACTCAACA
                 ||| |||
GTGCTCGGGTCACCTTACCGGACCGCCCAGGGCGGCCCAAG
21
CCGGACC

This is the section of my script, the top half deals with opening different files. it also prints a dictionary with the count of each sequence at the end.

list_of_positions = []

for match in re.finditer(r'(?=(%s))' % re.escape("-"), dict_seqs[E_C]): 
    list_of_positions.append(match.start())

set_of_positions = set(list_of_positions)
for position in list_of_positions:
    list_no_indels = []
    for number in range(position-20, position) :
        list_no_indels.append(number)
    for number in range(position+1, position+21) :
        list_no_indels.append(number)
    set_no_indels = set(list_no_indels)
    if len(set_no_indels.intersection(set_of_positions))> 0 : continue

    if len(set_no_indels.intersection(set_of_positions_EF))> 0 : continue


    print position 
    #print match.start()

    print dict_seqs[E_F][position -3:position +3]

    key = dict_seqs[E_F][position -3: position +3]

    if nt_dict.has_key(key):
        nt_dict[key] += 1 
    else:
        nt_dict[key] = 1


print nt_dict

Essentially, I was trying to edit the results of pairwise alignments to try and identify the nucleotides opposite the gaps in both the query and database sequences in order to conduct some basic Insertion/Deletion analysis.

I was able to solve one of my earlier issues by increasing the distance between gaps "-" to 20 nt's in an attempt to reduce noise, this has improved my results. Script edited above.

This is an example of my results and at the end I have a dictionary which counts the occurences of each sequence.

ATGCACAA-ACCTGTATG # query
ATGCAGAGGAAGAGCAAG # database
9 (position on the sequence)
GAGGAA (hexamer)


ATGCACAAGACCTGTATG # query
ATGCAGAG-AAGAGCAAG # database
9 (position)
CAAGAC (hexamer)

However, I am still trying to fix the script where I get the nucleotides around the gap to match exactly such as this, where the | is just to show matching nt's on each sequence:

GGTTACCG-ACCTGTATGTGAACTCAACA # query
     ||| ||
CCTTACCGGACCGCCCAGGGCGGCCCAAG # database

9
ACCGAC

Any help with this would be gratefully appreciated!

Is this +2 for bioinformatics, or at least 2 people understood your question? It seems I can't. Can you please clarify your question, i.e. what answer you are searching? Some test cases? — alko, Oct 31 '13 at 21:55
@alko I have provided some more details. Any suggestions would be great. Thanks! — sheaph, Nov 05 '13 at 18:01
It might be helpful to explain your terminology a little more. For instance, what do you mean by "query" and "database"? Is your "query" actually being sent into a database? And in your very first example, you have `ATGCACAA-ACCTGTATG` (query), `ATGCAGAGGAAGAGCAAG` (database), and then pull out`GAGGAAG` (hexamer), which appears in your database string, but not in your query string. And if you pull that pattern out of the database, you're left with `ATGCA-AGCAAG`, which again doesn't seem to relate to your query. Could you clarify the relationship between database, query and the haxamer subset? — jeremiahbuddha, Nov 05 '13 at 18:26
Putting away the fact your code is rather un-pythonic and trying undestand what your code do, I have **the question** for you: **do you know exactly what your code do?** If yes, please add comments to appropriate parts. Espessially, for lines `for match in re.finditer(r'(?=...`, `if len(set_no_indels.intersection(set_of_positions))> 0`, `key = dict_seqs[E_F][position -3: position +3]`. — alko, Nov 05 '13 at 19:29
You are reinventing the wheel. Why are you not using existing indel detection software? I also think you may be confusing terms, indels/SNVs are obtained from comparing to a reference species genome. It seems like you are looking more-so for software that goes along the lines of phylogeny/conservation. I'd recommend reading the work done comparing human to neanderthal genomes to see what approaches/programs were adopted there. — Chrismit, Nov 06 '13 at 17:57

score 1 · Accepted Answer · answered Nov 06 '13 at 15:43

I think I understand what you are trying to do but as @alko has said - comments in your code will definitely help a lot.

As to finding an exact match around the gap you could run a string comparison:

Something along the lines of:

if query[position -3: position] == database[position -3: position] and query[position +1: position +3] == database[position +1: position +3]:
   # Do something

You will need to replace "query" and "database" with what you have called your strings that you want to compare.

How to refine a python script for a bioinformatics query

1 Answers1