4

I have been working on a program which requires the counting of sub-strings (up to 4000 sub-strings of 2-6 characters located in a list) inside a main string (~400,000 characters). I understand this is similar to the question asked at Counting substrings in a string, however, this solution does not work for me. Since my sub-strings are DNA sequences, many of my sub-strings are repetitive instances of a single character (e.g. 'AA'); therefore, 'AAA' will be interpreted as a single instance of 'AA' rather than two instances if i split the string by 'AA'. My current solution is using nested loops, but I'm hoping there is a faster way as this code takes 5+ minutes for a single main string. Thanks in advance!

def getKmers(self, kmer):
    self.kmer_dict = {}
    kmer_tuples = list(product(['A', 'C', 'G', 'T'], repeat = kmer))
    kmer_list = []
    for x in range(len(kmer_tuples)):
        new_kmer = ''
        for y in range(kmer):
            new_kmer += kmer_tuples[x][y]
        kmer_list.append(new_kmer)
    for x in range(len(kmer_list)):
        self.kmer_dict[kmer_list[x]] = 0
    for x in range(len(self.sequence)-kmer):
        for substr in kmer_list:
            if self.sequence[x:x+kmer] == substr:
                self.kmer_dict[substr] += 1
                break
    return self.kmer_dict
Chris_Rands
  • 38,994
  • 14
  • 83
  • 119
DanStu
  • 174
  • 9
  • Have you try string.count()? It returns the number of (non-overlapping) occurrences of substring. – Louis Jan 23 '19 at 21:42

2 Answers2

8

For counting overlapping substrings of DNA, you can use Biopython:

>>> from Bio.Seq import Seq
>>> Seq('AAA').count_overlap('AA')
2

Disclaimer: I wrote this method, see commit 97709cc.

However, if you're looking for really high performance, Python probably isn't the right language choice (although an extension like Cython could help).

Chris_Rands
  • 38,994
  • 14
  • 83
  • 119
  • Great, thanks for the suggestion! I hadn't looked into Biopython yet, but that's exactly what I was looking for. – DanStu Jan 24 '19 at 04:28
  • Actually, once you are using extensions such as BioPython, performance is alright - what you can do is the algorithmic kernel in pure Python. – jsbueno Jan 24 '19 at 14:11
  • 1
    @jsbueno I wrote the Biopython implementation, it is pure Python, decent and useful I hope- my point is this is slow compared to say C- obvious I know, but method developers doing serious large scale k-mer analyses like this normally wouldn't choose Python for this (but it depends on their exact use case of course) – Chris_Rands Jan 24 '19 at 14:17
  • 2
    @Chris_Rands your code is certainly useful - it gets me above 100kb/s (compared to roughly 100kb/min before) when using 6-mers, which is more than fast enough for my purposes since I'm not working with eukaryotic sequences. – DanStu Jan 24 '19 at 20:22
2

Of course Python is fully able to perform these string searches. But instead of re-inventing all the wheels you will need, one screw at a time, you would be better of using a more specialized tool inside Python to deal with your problem - it looks like the BioPython project is the most activelly maintained and complete to deal with this sort of problem.

Short post with an example resembling your problem: https://dodona.ugent.be/nl/exercises/1377336647/

Link to the BioPython project documentation: https://biopython.org/wiki/Documentation

(if the problem were simply overlapping strings, then the 3rd party "regex" module would be a way to go - https://pypi.org/project/regex/ - as the built-in regex engine in Python's re module can't deal with overlapping sequence either)

jsbueno
  • 99,910
  • 10
  • 151
  • 209