Find nucleotide subsequence within fasta sequences

Question

I have to write a function with input a FASTA file containing dna sequences with ambiguous symbols (IUPAC). Given the name of the FASTA file and an unambiguous DNA string, I want to write out the identifiers of the sequences ('>' headers) of which the given sequence could be a subsequence. I would like to implement this without generating all possible sequences and the subsequence can have ambiguous symbols as well as the sequences in the FASTA file. Example: the sequence “ACC” could be a subsequence of “CGMBHTW”. Can someone help me with it?

Can you provide *any* code for us to get started with? As for me, I have no idea about the biology behind this question, so any more clarity as to the problem you are trying to solve would also help.. — chickity china chinese chicken, Nov 30 '16 at 00:49
Can you provide some test examples showing the input and the correct output for each test? — Steve, Nov 30 '16 at 13:51
Possible duplicate of [Can Biopython perform Seq.find() accounting for ambiguity codes](http://stackoverflow.com/questions/32192933/can-biopython-perform-seq-find-accounting-for-ambiguity-codes) — Vince, Nov 30 '16 at 14:21

score 0 · Answer 1 · answered Dec 07 '16 at 11:05

You could try defining a "generalized" nucleotide as a set of letters that it represents, then convert your sequences in lists of such sets and scan one sequence for the presence of a compatible position with the other.

Here is something that may not be the most efficient code, but seems to work (double-check I got the looping indices right, though...).

A = {"A"}
C = {"C"}
G = {"G"}
T = {"T"}
R = A | G
Y = C | T
S = G | C
W = A | T
K = G | T
M = A | C
B = C | G | T
D = A | G | T
H = A | C | T
V = A | C | G
N = {"A", "C", "G", "T"}

letter2nucl = {
    "A" : A,
    "C" : C,
    "G" : G,
    "T" : T,
    "R" : R,
    "Y" : Y,
    "S" : S,
    "W" : W,
    "K" : K,
    "M" : M,
    "B" : B,
    "D" : D,
    "H" : H,
    "V" : V,
    "N" : N}

def is_subseq(seq1, seq2):
    l1 = len(seq1)
    l2 = len(seq2)
    nucls1 = [letter2nucl[letter] for letter in seq1]
    nucls2 = [letter2nucl[letter] for letter in seq2]
    i = 0
    while i < 1 + l2 - l1:
        subseq = True
        for j, nucl in enumerate(nucls1):
            if not (nucls2[i+j] & nucl):
                # empty set intersection
                subseq = False
                break
        if subseq:
            return True
        i += 1
    return False

seq1 = "ACC"
seq2 = "CGMBHTW"

if is_subseq(seq1, seq2):
    print("%s is subsequence of %s" % (seq1, seq2))

seq1 = "GRT"
seq2 = "AATCBAT"

if is_subseq(seq1, seq2):
    print("%s is subsequence of %s" % (seq1, seq2))

Results are:

ACC is subsequence of CGMBHTW
GRT is subsequence of AATCBAT

Then, you can use this on sequences read using Biopython's SeqIO functions.

Find nucleotide subsequence within fasta sequences

1 Answers1