Python/Biopython: How to search for a reference sequence (string) in a sequence with gaps?

Question

I am facing the following problem and have not found a solution yet:

I am working on a tool for sequence analysis which uses a file with reference sequences and tries to find one of these reference sequences in a test sequence.

The problem is that the test sequence might contain gaps (for example: ATG---TCA). I want my tool to find a specific reference sequence as substring of the test sequence even if the reference sequence is interrupted by gaps (-) in the test sequence.

For example:

one of my reference sequences: a = TGTAACGAACGG

my test sequence: b = ACCT**TGT--CGAA-GG**AGT

(the corresponding part from the reference sequence is given in bold)

I though about regular expressions and tried to work myself into it but if I am not wrong regular expressions only work the other way round. So I would need to include the gap positions as regular expressions into the reference sequence and than map it against the test sequence. However, I do not know the positions, the length and the number of gaps in the test sequence. My idea was to exchange gap positions (so all -) in the test sequence string into some kind of regular expressions or into a special character which stand for any other character in the reference sequence. Than I would compare the unmodified reference sequences against my modified test sequence... Unfortunately I have not found a function in python for string search or a type of regular expression which could to this.

Thank you very much!

@Jan, this is the point of the problem, 'b' has gaps represented by '-' characters, the problem is to align with 'a' as closely as possible taking in to account the gaps. In the sample above the gaps in 'b' map to AA and C in 'a', so 'b' does cover 'a'. — Steve, Aug 15 '16 at 13:37

score 0 · Answer 1 · answered Aug 15 '16 at 13:23

0

You could do this:

import re

a = 'TGTAACGAACGG'
b = 'ACCTTGT--CGAA-GGAGT'

temp_b = re.sub(r'[\W_]+', '', b) #removes everything that isn't a number or letter

if a in temp_b:
    #do something

answered Aug 15 '16 at 13:23

dheiberg

1,914
14
18

If your regex only removes everything that isn't a number or letter string b would look like: ACCTTGTCCGAAGGAGT. In this case string a would no longer be a substring of string b as "AA" and "C" would be missing in the positions where the "-"s have been before. Is this right? – Sefu Aug 15 '16 at 14:04
Oh now I know what you want. I'll write the code when I'm home, its simple enough, but maybe not entirely in regex – dheiberg Aug 15 '16 at 14:12
Would be great to see your code. Thank you very much for your help! – Sefu Aug 16 '16 at 06:39

Steve · Accepted Answer · 2016-08-15T14:12:21.333

0

There's good news and there's bad news...

Bad news first: What you are trying to do it not easy and regex is really not the way to do it. In a simple case regex could be made to work (maybe) but it will be inefficient and would not scale.

However, the good news is that this is well understood problem in bioinformatics (e.g. see https://en.wikipedia.org/wiki/Sequence_alignment). Even better news is that there are tools in Biopython that can help you. E.g. http://biopython.org/DIST/docs/api/Bio.pairwise2-module.html

EDIT From the discussion below it seems you are saying that 'b' is likely to be very long, but assuming 'a' is still short (12 bases in your example above) I think you can tackle this by iterating over every 12-mer in 'b'. I.e. divide 'b' into sequences that are 12 bases long (obviously you'll end up with a lot!). You can then easily compare the two sequences. If you really want to use regex (and I still advise you not to) then you can replace the '-' with a '.' and do a simple match. E.g.

import re

''' a is the reference '''
a = 'TGTAACGAACGG'

''' b is 12-mer taken from the seqence of interest, in reality you'll be  doing this test for every possible 12-mer in the sequence'''
b = 'TGT--CGAA-GG'

b = b.replace('-', '.')
r = re.compile(b);
m = r.match(a)

print(m)

edited Aug 15 '16 at 14:12

answered Aug 15 '16 at 13:28

Steve

8,469
1
26
37

Even better yet would be an answer with some actual code :) – Jan Aug 15 '16 at 13:30
Thank you very much for your answer. I know about alignments but that is something I do not want to use for this problem for specific reasons. – Sefu Aug 15 '16 at 13:46
Is there any way to introduce a kind of "placeholder" character instead of the "-" which could be understood as any character in a direct sequence comparisson? What I am search for should do the following: 1) Take the reference sequence and compare it character by character to the test sequence. If there is one "-" or several "----" in the test sequence the comparison should still be true – Sefu Aug 15 '16 at 13:53
I'm curious as to why you don't want to use alignment algorithms here, this looks like a classical alignment problem? I assume I'm missing something :-) I'll add some more to my answer that might help. – Steve Aug 15 '16 at 13:55
Another option would be that the function splits the test sequence into several subsequences which are devided by the gaps. The function should remember the lengths of the gaps and than split the reference sequence into corresponding subsequences (removing the gaps regions) which are than all mapped against the subsequences of the test sequence. – Sefu Aug 15 '16 at 13:57
The tool will be used to analyze high numbers of sequences. The sequences are much longer than my example sequence (the actual sequences are constructed from several sanger reads). Calculating alignments would be very time consuming (I already tested it). Furthermore alignments are always error prone and would often require manual evaluation. I am looking for a fully automated approach here. The sequences I like to compare will only have between 0 and 2 gaps (from varying length) and the rest of the test sequence will map 100% to one of the reference sequences. – Sefu Aug 15 '16 at 14:03
Thank you very much for the edited code. My string a will be around 1600-1800 characters and string b will be longer (>1800). String will at the maximum have two subareas consisting of "-"s (continues and length from 1 to about 80). I could trimm my string b to have the same starting and ending sequences as string a, use your regex example and than direclty compare b and a. Initially I wanted to avoid the trimming of b and see if string a is in string b (avoiding the "-"s for the comparison). But it seems it is only possible the other way round. Thank you very much for your help. – Sefu Aug 16 '16 at 06:45
By the way, would you still recommend performing an alignment first? – Sefu Aug 16 '16 at 06:48
Hmm, that information is interesting, you have >1600 bases and only two sections of missing bases ('-' characters). That changes it a lot. I would actually break up your input in to 3 sections, the sub areas around the "-" sections and then look for those complete reads in your reference section, when you get a "hit" you can then check the other subsections. Would something like that work? – Steve Aug 16 '16 at 10:09

Python/Biopython: How to search for a reference sequence (string) in a sequence with gaps?

2 Answers2