Fastest way to count instances of substrings in string Python3.6

Question

I have been working on a program which requires the counting of sub-strings (up to 4000 sub-strings of 2-6 characters located in a list) inside a main string (~400,000 characters). I understand this is similar to the question asked at Counting substrings in a string, however, this solution does not work for me. Since my sub-strings are DNA sequences, many of my sub-strings are repetitive instances of a single character (e.g. 'AA'); therefore, 'AAA' will be interpreted as a single instance of 'AA' rather than two instances if i split the string by 'AA'. My current solution is using nested loops, but I'm hoping there is a faster way as this code takes 5+ minutes for a single main string. Thanks in advance!

def getKmers(self, kmer):
    self.kmer_dict = {}
    kmer_tuples = list(product(['A', 'C', 'G', 'T'], repeat = kmer))
    kmer_list = []
    for x in range(len(kmer_tuples)):
        new_kmer = ''
        for y in range(kmer):
            new_kmer += kmer_tuples[x][y]
        kmer_list.append(new_kmer)
    for x in range(len(kmer_list)):
        self.kmer_dict[kmer_list[x]] = 0
    for x in range(len(self.sequence)-kmer):
        for substr in kmer_list:
            if self.sequence[x:x+kmer] == substr:
                self.kmer_dict[substr] += 1
                break
    return self.kmer_dict

Have you try string.count()? It returns the number of (non-overlapping) occurrences of substring. — Louis, Jan 23 '19 at 21:42

score 8 · Accepted Answer · answered Jan 23 '19 at 21:49

8

For counting overlapping substrings of DNA, you can use Biopython:

>>> from Bio.Seq import Seq
>>> Seq('AAA').count_overlap('AA')
2

Disclaimer: I wrote this method, see commit 97709cc.

However, if you're looking for really high performance, Python probably isn't the right language choice (although an extension like Cython could help).

answered Jan 23 '19 at 21:49

Chris_Rands

38,994
14
83
119

Great, thanks for the suggestion! I hadn't looked into Biopython yet, but that's exactly what I was looking for. – DanStu Jan 24 '19 at 04:28
Actually, once you are using extensions such as BioPython, performance is alright - what you can do is the algorithmic kernel in pure Python. – jsbueno Jan 24 '19 at 14:11
1

@jsbueno I wrote the Biopython implementation, it is pure Python, decent and useful I hope- my point is this is slow compared to say C- obvious I know, but method developers doing serious large scale k-mer analyses like this normally wouldn't choose Python for this (but it depends on their exact use case of course) – Chris_Rands Jan 24 '19 at 14:17
2

@Chris_Rands your code is certainly useful - it gets me above 100kb/s (compared to roughly 100kb/min before) when using 6-mers, which is more than fast enough for my purposes since I'm not working with eukaryotic sequences. – DanStu Jan 24 '19 at 20:22

score 2 · Answer 2 · answered Jan 23 '19 at 21:42

Of course Python is fully able to perform these string searches. But instead of re-inventing all the wheels you will need, one screw at a time, you would be better of using a more specialized tool inside Python to deal with your problem - it looks like the BioPython project is the most activelly maintained and complete to deal with this sort of problem.

Short post with an example resembling your problem: https://dodona.ugent.be/nl/exercises/1377336647/

Link to the BioPython project documentation: https://biopython.org/wiki/Documentation

(if the problem were simply overlapping strings, then the 3rd party "regex" module would be a way to go - https://pypi.org/project/regex/ - as the built-in regex engine in Python's re module can't deal with overlapping sequence either)

Fastest way to count instances of substrings in string Python3.6

2 Answers2