Fast python way to check if subsequence exists in a string and allow for 1-3 mismatches?

Question

I am wondering if there is a fast Python way to check if a subsequence exists in a string, while allowing for 1-3 mismatches.

For example the string: "ATGCTGCTGA" The subsequence "ATGCC" would be acceptable and return true. Note that there is 1 mismatch in bold.

I have tried to use the pairwise2 functions from the Bio package but it is very slow. I have 1,000 strings, and for each string I want to test 10,000 subsequences.

Computation speed would be prioritized here.

** Note I don't mean gaps, but one nucleotide (the letter A, T, C, G) being substituted for another one.

Please provide enough code so others can better understand or reproduce the problem. — Community, Aug 19 '22 at 22:18

score 1 · Answer 1 · answered Aug 19 '22 at 23:05

1

One can zip the strings and compare tuples left and right side, and count false. Here false is just 1. Should not be slow...

st = "ATGCTGCTGA"

s = "ATGCC"

[ x==y for (x,y) in zip(st,s)].count(False)

1

answered Aug 19 '22 at 23:05

LetzerWille

5,355
4
23
26

score 1 · Answer 2 · answered Aug 19 '22 at 23:20

Try:

ALLOWED_MISMATCHES = 3

s = "ATGCTGCTGA"
subsequence = "ATGCC"

for i in range(len(s) - len(subsequence) + 1):
    if sum(a != b for a, b in zip(s[i:], subsequence)) <= ALLOWED_MISMATCHES:
        print("Match")
        break
else:
    print("No Match")

Prints:

Match

JvdV · Answer 3 · 2022-08-20T14:33:43.027

1

This is possible with regex if you use the PyPi's regex module instead with fuzzy matching:

ATGCC{i<=3:[ATGC]}

ATGCC - Look for exactly 'ATGCC'
{i<=3:[ATCG]} - Allow for up to three insertions that are within the character class of nucleotide character [ATGC].

For example:

import regex as re
s = 'ATGCTGCTGA'
print(bool(re.search(r'ATGCC{i<=3:[ATGC]}', s)))

Prints:

True

edited Aug 20 '22 at 14:33

answered Aug 20 '22 at 11:33

JvdV

70,606
8
39
70

Padix Key · Answer 4 · 2022-08-24T12:02:19.877

Especically, if your subsequences have all the same length, you could try k-mer matching with Biotite, a package I am developer of. To allow mismatches, you can generate similar subsequences in the matching process using a SimilarityRule. In this example the rule and substitution matrix is set, so that all subsequences with up to MAX_MISMATCH mismatches are enumerated. Since the k-mer matching is implemented in C it should run quite fast.

import numpy as np
import biotite.sequence as seq
import biotite.sequence.align as align

K = 5
MAX_MISMATCH = 1


database_sequences = [
    seq.NucleotideSequence(seq_str) for seq_str in [
        "ATGCTGCTGA",
        # ...
    ]
]

query_sequences = [
    seq.NucleotideSequence(seq_str) for seq_str in [
        "ATGCC",
        # ...
    ]
]


kmer_table = align.KmerTable.from_sequences(K, database_sequences)

# The alphabet for the substitution matrix
# should not contain unambiguous symbols, such as 'N'
alphabet = seq.NucleotideSequence.alphabet_unamb
matrix_entries = np.full((len(alphabet), len(alphabet)), -1)
np.fill_diagonal(matrix_entries, 0)
matrix = align.SubstitutionMatrix(alphabet, alphabet, matrix_entries)
print(matrix)
print()

# The similarity rule will allow up to MAX_MISMATCH mismatches
similarity_rule = align.ScoreThresholdRule(matrix, -MAX_MISMATCH)
for i, sequence in enumerate(query_sequences):
    matches = kmer_table.match(sequence, similarity_rule)
    # Colums:
    # 0. Sequence position of match in query (first position of k-mer)
    # 1. Index of DB sequence in list
    # 2. Sequence position of match in DB sequence (first position of k-mer)
    print(matches)
    print()
    index_of_matches = np.unique(matches[:, 1])
    for j in index_of_matches:
        print(f"Match of subsequence {i} to sequence {j}")
        print()

Output:

    A   C   G   T
A   0  -1  -1  -1
C  -1   0  -1  -1
G  -1  -1   0  -1
T  -1  -1  -1   0

[[0 0 0]]

Match of subsequence 0 to sequence 0

If your subsequences have different lengths, but are all quite short (~ up length 10) you could create a KmerTable for each subsequence length (K = length) and match each table to the subsequences with the respective length. If your subsequences are much larger than that, this approach probably will not work due to the memory requirements of a KmerTable.

Fast python way to check if subsequence exists in a string and allow for 1-3 mismatches?

4 Answers4