I am facing the following problem and have not found a solution yet:
I am working on a tool for sequence analysis which uses a file with reference sequences and tries to find one of these reference sequences in a test sequence.
The problem is that the test sequence might contain gaps (for example: ATG---TCA
).
I want my tool to find a specific reference sequence as substring of the test sequence even if the reference sequence is interrupted by gaps (-
) in the test sequence.
For example:
one of my reference sequences:
a = TGTAACGAACGG
my test sequence:
b = ACCT**TGT--CGAA-GG**AGT
(the corresponding part from the reference sequence is given in bold)
I though about regular expressions and tried to work myself into it but if I am not wrong regular expressions only work the other way round. So I would need to include the gap positions as regular expressions into the reference sequence and than map it against the test sequence.
However, I do not know the positions, the length and the number of gaps in the test sequence.
My idea was to exchange gap positions (so all -
) in the test sequence string into some kind of regular expressions or into a special character which stand for any other character in the reference sequence. Than I would compare the unmodified reference sequences against my modified test sequence...
Unfortunately I have not found a function in python for string search or a type of regular expression which could to this.
Thank you very much!