Running Time Complexity of my Algorithm - how do i compute this and further optimize the algorithm?

Question

I designed a recursive algorithm and wrote it down in Python. When I measure the running time with different parameters, it seems to take exponential time. Furthermore; it takes more than half an hour to end with small numbers such as 50. (I didn't wait until it finishes, but it doesn't seem to finish in a reasonable amount of time, guess it's exponential).

So, I'm curious about the running time complexity of this algorithm. Can someone please help me derive the equation T(n,m)? Or to compute the big-oh?

The algorithm is below:

# parameters:
# search string, the index where we left on the search string, source string, index where we left on the source string,
# and the indexes array, which keeps track of the indexes found for the characters
def find(search, searchIndex, source, sourceIndex, indexes):
    found = None
    if searchIndex < len(search): # if we haven't reached the end of the search string yet
        found = False
        while sourceIndex < len(source): # loop thru the source, from where we left off
            if search[searchIndex] == source[sourceIndex]: # if there is a character match
                # recursively look for the next character of search string 
                # to see if it can be found in the remaining part of the source string
                if find(search, searchIndex + 1, source, sourceIndex + 1, indexes):
                    # we have found it
                    found = True # set found = true
                    # if an index for the character in search string has never been found before.
                    # i.e if this is the first time we are finding a place for that current character
                    if indexes[searchIndex] is None:
                        indexes[searchIndex] = sourceIndex # set the index where a match is found
                    # otherwise, if an index has been set before but it's different from what
                    # we are trying to set right now. so that character can be at multiple places.
                    elif indexes[searchIndex] != sourceIndex: 
                        indexes[searchIndex] = -1 # then set it to -1.
            # increment sourceIndex at each iteration so as to look for the remaining part of the source string. 
            sourceIndex = sourceIndex + 1
    return found if found is not None else True

def theCards(N, colors):
    # allcards: a list 1..N of characters where allcards[i] is 'R' if i is a prime number, 'B' otherwise.
    # so in this example where N=7, allcards=['B','R','R','B','R','B','R']
    allcards = ['R' if isPrime(i) else 'B' for i in range(1, N + 1)]
    # indexes is initially None.
    indexes = [None] * len(colors)

    find(colors, 0, allcards, 0, indexes)
    return indexes    

if __name__ == "__main__":
    print theCards(7, list("BBB"))

I don't know if one has to understand the problem and the algorithm in order to derive the worst-case running time, but here is the problem I attempted to solve:

The Problem:

Given a source string SRC and a search string SEA, find the subsequence SEA in SRC and return the indexes of where each character of SEA was found in SRC. If a character in SEA can be at multiple places in SRC, return -1 for that characters position.

For instance; if the source string is BRRBRBR (N=7) and the search string is BBB: then the first 'B' in 'BBB' can appear at index 0 in the search string. The second 'B' can be at index 3 of the search string and the last 'B' can be at the 5th position. Furthermore; there exists no other alternatives for the positions of the characters 'BBB', and thus the algorithm returns [0,3,5].

In another case, where the source string is BRRBRB (N=6) and the search string is RBR: the first 'R' of 'RBR' can be at position 1 or 2. This leaves only position 3 for 'B' and position 4 for the last 'R'. Then, the first 'R' can be at multiple places, it's place is ambigious. The other two characters, B and R, have only one place. So the algorithm returns [-1,4,5].

The case where the algorithm doesn't finish and take forever is when the source string is ['B', 'R', 'R', 'B', 'R', 'B', 'R', 'B', 'B', 'B', 'R', 'B', 'R', 'B', 'B', 'B', 'R', 'B', 'R', 'B', 'B', 'B', 'R', 'B', 'B', 'B', 'B', 'B', 'R', 'B', 'R', 'B', 'B', 'B', 'B', 'B', 'R', 'B', 'B', 'B', 'R', 'B', 'R', 'B', 'B', 'B', 'R', 'B', 'B', 'B', 'B', 'B', 'R', 'B', 'B', 'B', 'B', 'B'] (N=58) and the search string is RBRRBRBBRBRRBBRRBBBRRBBBRR. It should return [-1, -1, -1, -1, -1, -1, -1, -1, 17, 18, 19, 23, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 47, 53 ], but unfortunately it doesn't =(

Optimizations:

I thought of halting the search when the 'indexes' list was completely full of -1s. But that only affects the best-case (or maybe the average-case) but not the worst-case. How can one further optimize this algorithm. I know that there exists a polynomial solution to this problem.

More important than the optimizations, I'm really curious about the T(n,m) equation of the running time, where n and m are the lengths of the source and search strings.

If you were able to read until here, thank you very much! =)

EDIT - IVIad's solution implemented:

def find2(search, source):
    indexes = list()
    last = 0
    for ch in search:
        if last >= len(source):
            break
        while last < len(source) and source[last] != ch:
            last = last + 1
        indexes.append(last)
        last = last + 1
    return indexes

def theCards(N, colors):
    # allcards: a list 1..N of characters where allcards[i] is 'R' if i is a prime number, 'B' otherwise.
    allcards = ['R' if isPrime(i) else 'B' for i in range(1, N + 1)]

    indexes = find2(colors, allcards) # find the indexes of the first occurrences of the characters
    colors.reverse() # now reverse both strings
    allcards.reverse()
    # and find the indexes of the first occurrences of the characters, again, but in reversed order
    indexesreversed = find2(colors, allcards)
    indexesreversed.reverse() # reverse back the resulting list of indexes 
    indexesreversed = [len(allcards) - i - 1 for i in indexesreversed] # fix the indices

    # return -1 if the indices are different when strings are reversed
    return [indexes[i] + 1 if indexes[i] == indexesreversed[i] else - 1 for i in range(0, len(indexes))]

if __name__ == "__main__":
    print theCards(495, list("RBRRBRBBRBRRBBRRBBBRRBBBRR"))

I'm not sure I understand the algorithm yet, but it looks like it should be O(n^2). — Gabe, Jan 28 '11 at 14:37
@Murat- You "know that there exists a polynomial solution to this problem." I'm not disagreeing, just curious: how do you know? — Justin, Jan 28 '11 at 14:38
You should accepts |V|ad's answer it is the correct one, I misunderstood exactly what you were trying to do because of your code and explanation so the answer I provided is for a slightly more complex problem. — Jesus Ramos, Jan 28 '11 at 16:35
Yeah, I know, but I didn't have a chance to give much thought on it. Now I'm home, I implemented his solution, posted it as an edit to my original post. And thank you for your contribution ^^ — Murat Derya Özen, Jan 28 '11 at 18:39
Yeah the wording on these problems can be kind of tricky and at first glance it looked like LCS to me. I guess even after years of algorithms some stuff still gets by me. Thanks again for catching that |V|ad. — Jesus Ramos, Jan 28 '11 at 19:15

IVlad · Accepted Answer · 2011-01-28T20:24:19.273

4

You can do it in O(n + m), where m is the number of characters in SEA, and n the number of characters in SRC:

last = 1
for i = 1 to m do
    while SRC[last] != SEA[i]
        ++last

    print last
    ++last (skip this match)

Basically, for each character in SEA, find its position in SRC, but only scan after the position where you found the previous character.

For instance; if the source string is BRRBRBR (N=7) and the search string is BBB

Then: find B in SRC: found at last = 1 print 1, set last = 2.

Find B in SRC: found at last = 4, print 4, set last = 5.

Find B in SRC: found at last = 6, print 6, set last = 7. Done.

As for the complexity of your algorithm, I'm not able to provide a very formal analysis, but I'll try to explain how I'd go about it.

Assume that all characters are equal in both SRC and SEA and between them. Therefore we can eliminate the condition in your while loop. Also note that your while loop executes n times.

Note that for the first character you will call find(1, 1), ... find(m, n). But these will also start their while loops and make their own recursive calls. Each find(i, j) will make O(m) recursive calls that in its while loop, for i = 1 to n. But these in turn will make more recursive calls themselves, resulting in a sort of "avalanche effect" that causes exponential complexity.

So you have:

i = 1: calls find(2, 2), find(3, 3), ..., find(m, n)
       find(2, 2) calls find(3, 3), ..., find(m, n)
       find(3, 3) calls find(4, 4), ..., find(m, n)
       find(4, 4) calls find(5, 5), ..., find(m, n)
       ...
       total calls: O(m^m)
i = 2: same, but start from find(2, 3).
...
i = n: same

Total complexity thus looks like O(n*m^m). I hope this makes sense and I haven't made any mistakes.

edited Jan 28 '11 at 20:24

answered Jan 28 '11 at 15:49

IVlad

43,099
13
111
179

IVlad, thanks for the answer but there's one thing I don't get. The first 'B' is found at last=1, and the other two B's are found at the remaining part of SRC. OK, but I want to find more than one occurrence of the subsequence. Say SRC was BRRBRBRB (one more 'B' added). Then I would like the algorithm to say that "'B' can be at multiple places. Similarly, the second and the third 'B' can also be at multiple places." and return [-1,-1,-1]. So, you would have to not only check the first occurrences, but perhaps all occurrences to be sure. How can we change your algorithm to adapt this situation? – Murat Derya Özen Jan 28 '11 at 16:01
@Murat - sorry, I haven't read your question properly it seems. However, this should be easy to handle. Reverse `SRC` and `SEA` and run the same algorithm again. Print -1 for the positions where you get different values between the two runs. This should work because the algorithm always finds the first possible match, going from left to right. So if you reverse the string and get different matches, you know where there are multiple possibilities. For example, ran on the reversed strings you'd get `1, 3, 5`. Accounting for the inversions this translates easily to `4 6 8` => `-1 -1 -1`. – IVlad Jan 28 '11 at 16:10
I think this is a decent approach, IVIlad, it takes linear time as you mentioned. Thank you very much. I'm posting a Python implementation of the algorithm you proposed. Doesn't even take a second to complete with N=58 and the search string longer than 20 characters! – Murat Derya Özen Jan 28 '11 at 18:32
By the way, do you happen to know how can I derive the T(n,m) equation or compute the big-oh complexity of the method I first wrote (the code above)? Just out of curiosity... – Murat Derya Özen Jan 28 '11 at 18:42
@Murat - I attempted to do so. I hope it makes sense. – IVlad Jan 28 '11 at 20:25
@IVlad - Your analysis seems correct, at least it explains the exponential growth. Thanks a lot for the help IVIlad! Best, – Murat Derya Özen Jan 28 '11 at 21:49

score 3 · Answer 2 · answered Jan 28 '11 at 14:49

3

This is simply the longest common subsequence problem. This can be implemented with dynamic programming to get a lot less than exponential time. In your case when LCS returns the length of SEA then you know that the sequence SEA exists in SRC, saving their indexes is a trivial thing when modifying the algorithm. Here's a link to a good explanation. http://en.wikipedia.org/wiki/Longest_common_subsequence_problem

answered Jan 28 '11 at 14:49

Jesus Ramos

22,940
10
58
88

@Murat: your current algorithm will run in O(2^n) (if I'm not mistaken or misunderstanding your code) time which is unacceptable in contest problem solving environments. LCS runs in O(n * m) which is the length of the first string times the length of the 2nd or O(n^2) worst case when both strings are of equal length. One further optimization is to realize that you only need 2 rows of the DP table to solve the problem instead of the whole DP matrix – Jesus Ramos Jan 28 '11 at 15:11
This is overkill. There's no need to use LCS when this has nothing to do with the concept of "longest" in any way. – IVlad Jan 28 '11 at 15:50
@|V|ad: You're correct, I was reading the code and assumed a little too much in this case it's solvable in O(n) where n is the length of the longer string. I will edit my answer to reflect this, thanks again for pointing this out. – Jesus Ramos Jan 28 '11 at 16:23

Nick Dandoulakis · Answer 3 · 2011-01-28T16:34:36.973

0

From a quick look, you're searching, recursing and backtracking?

I believe creating a suffix array for your source string would be a good idea.
Constructing the suffix array has O(nlogn) complexity. Locating a substring has O(logn) time complexity.

edited Jan 28 '11 at 16:34

answered Jan 28 '11 at 14:52

Nick Dandoulakis

42,588
16
104
136

But then how can I make use of the suffix array I created? – Murat Derya Özen Jan 28 '11 at 14:56
@Murat: you can find a match with binary search. Once you locate a match, you can scan forward and backward from that position (sequentially) to match more substrings. – Nick Dandoulakis Jan 28 '11 at 14:59
Even more overkill. Also, you are talking about substrings when the OP wants subsequence. The two are not the same. -1. – IVlad Jan 28 '11 at 15:51
@IVlad: it's not overkill in case we're going to search the same string for many substrings. But you're right. The OP wants subsequence. Oops. – Nick Dandoulakis Jan 28 '11 at 16:43

Running Time Complexity of my Algorithm - how do i compute this and further optimize the algorithm?

3 Answers3