finding the most amount of times a substring appears successively in a string

Question

I have a long string of characters and not only am I trying to find if a substring of those characters exists within the larger string, I'm also trying to find the longest run of successive instances.

For example... in the code snippet below I've found that I can use "count" to see how many times the substring b appears in a. That result is 5. However, what I'm trying to identify is the longest successive run, which would be 3 (where 'abc' appears back to back to back in the middle). I'm having difficulty running through the logic of this one. Any advice would be appreciated.

a = "abcxyzabcabcabcxyzabcxyz"

b = "abc"

total = a.count(b)

print(total)

score 1 · Accepted Answer · answered Oct 25 '19 at 02:51

1

This should be fairly simple with a while loop:

def func(a, b): 
    n = 1 
    while b*n in a: 
        n += 1 
    return n - 1

answered Oct 25 '19 at 02:51

wim

338,267
99
616
750

Probably perfectly fast for any reasonable string, but quite inefficient algorithmically. +1 – Mad Physicist Oct 25 '19 at 03:14
It is possible that if the string is really long, this solution would run out of memory or run very slowly due to the number of repeated searches performed though – ashiswin Oct 25 '19 at 03:14
If this runs out of memory then you've already used up half your memory with the string `a`. **Downvoter:** substring checks are highly optimized in Python, so in practice I'd be surprised to see a faster version than this, despite the algorithmic inefficiency (but happy to be proved wrong). – wim Oct 25 '19 at 03:48

score 0 · Answer 2 · answered Oct 25 '19 at 02:46

One possible and naive solution is to use the python index function to identify the closest index of the substring. From there you can simply continue searching ahead for the substring until you find a point where it doesnt appear anymore, then call index again to skip ahead.

Example:

a = "abcxyzabcabcabcxyzabcxyz"
b = "abc"

curr_index = a.index(b)
longest_count = 0
current_count = 0

while curr_index < len(a):
    if a[curr_index : curr_index + len(b)] == b:
        curr_index += len(b)
        current_count += 1
    else:
        if longest_count < current_count:
            longest_count = current_count
        try:
            curr_index = a.index(b, curr_index)
        except ValueError:
            # Substring no longer found in string slice
            break
        current_count = 0

if longest_count < current_count:
    longest_count = current_count

print(longest_count)

This just returns the longest repeating count, but it doesn't return the location where it starts. Adding that functionality, however, is trivial.

You can keep using index for your search – Mad Physicist Oct 25 '19 at 03:12 — Mad Physicist, Oct 25 '19 at 03:12

score 0 · Answer 3 · edited Oct 25 '19 at 03:53

Keep calling a.index on b with the appropriate indices. If the index is the start of your subset, you're in the same run. Otherwise, start a new run:

def longest_run(string, pattern):
    longest = 0
    current = 0
    start = 0
    while True:
        try:
            ind = string.index(pattern, start)
            if ind == start:
                current += 1
            else:
                if current > longest:
                    longest = current
                current = 1
            start += len(pattern)
        except ValueError:
            return longest

score 0 · Answer 4 · answered Oct 25 '19 at 05:00

You can use re.findall with a pattern that matches one or more times of b (use re.escape to prevent b from being interpreted as regex), then map the returning strings to len and pass them to max to obtain the length of the longest match, and then divide that length by the length of b to get the number of times b is repeated:

import re
max(map(len, re.findall('(?:%s)+' % re.escape(b), a))) // len(b)

finding the most amount of times a substring appears successively in a string

4 Answers4