Especically, if your subsequences have all the same length, you could try k-mer matching with Biotite, a package I am developer of.
To allow mismatches, you can generate similar subsequences in the matching process using a SimilarityRule
. In this example the rule and substitution matrix is set, so that all subsequences with up to MAX_MISMATCH
mismatches are enumerated. Since the k-mer matching is implemented in C it should run quite fast.
import numpy as np
import biotite.sequence as seq
import biotite.sequence.align as align
K = 5
MAX_MISMATCH = 1
database_sequences = [
seq.NucleotideSequence(seq_str) for seq_str in [
"ATGCTGCTGA",
# ...
]
]
query_sequences = [
seq.NucleotideSequence(seq_str) for seq_str in [
"ATGCC",
# ...
]
]
kmer_table = align.KmerTable.from_sequences(K, database_sequences)
# The alphabet for the substitution matrix
# should not contain unambiguous symbols, such as 'N'
alphabet = seq.NucleotideSequence.alphabet_unamb
matrix_entries = np.full((len(alphabet), len(alphabet)), -1)
np.fill_diagonal(matrix_entries, 0)
matrix = align.SubstitutionMatrix(alphabet, alphabet, matrix_entries)
print(matrix)
print()
# The similarity rule will allow up to MAX_MISMATCH mismatches
similarity_rule = align.ScoreThresholdRule(matrix, -MAX_MISMATCH)
for i, sequence in enumerate(query_sequences):
matches = kmer_table.match(sequence, similarity_rule)
# Colums:
# 0. Sequence position of match in query (first position of k-mer)
# 1. Index of DB sequence in list
# 2. Sequence position of match in DB sequence (first position of k-mer)
print(matches)
print()
index_of_matches = np.unique(matches[:, 1])
for j in index_of_matches:
print(f"Match of subsequence {i} to sequence {j}")
print()
Output:
A C G T
A 0 -1 -1 -1
C -1 0 -1 -1
G -1 -1 0 -1
T -1 -1 -1 0
[[0 0 0]]
Match of subsequence 0 to sequence 0
If your subsequences have different lengths, but are all quite short (~ up length 10) you could create a KmerTable
for each subsequence length (K = length) and match each table to the subsequences with the respective length. If your subsequences are much larger than that, this approach probably will not work due to the memory requirements of a KmerTable
.