3

I would like to know what's the complexity of the following algorithm and, most importantly, the step by step process which leads to deducing it.

I suspect it's O(length(text)^2*length(pattern)) but I have trouble solving the recurrence equation.

How would the complexity improve when doing memoization (i.e. dynamic programming) on the recursive calls?

Also, I would appreciate pointers to techniques/books which would help me learn how to analyze this kind of algorithms.

In Python:

def count_matches(text, pattern):
  if len(pattern) == 0: return 1

  result = 0
  for i in xrange(len(text)):
    if (text[i] == pattern[0]):
      # repeat the operation with the remaining string a pattern
      result += count_matches(text[i+1:], pattern[1:])

  return result

In C:

int count_matches(const char text[],    int text_size, 
                  const char pattern[], int pattern_size) {

  if (pattern_size == 0) return 1;

  int result = 0;

  for (int i = 0; i < text_size; i++) {
    if (text[i] == pattern[0])
      /* repeat the operation with the remaining string a pattern */
      result += count_matches(text+i, text_size-(i+1), 
                              pattern+i, pattern_size-(i+1));
  }

  return result;  
}

Note: The algorithm intentionally repeats the matching for every substring. Please don't focus in what kind of matching the algorithm is performing, just on its complexity.

Apologies for the (now fixed) typos in the algorithms

Rndm
  • 6,710
  • 7
  • 39
  • 58
fons
  • 4,905
  • 4
  • 29
  • 49
  • I think both of the examples must be *wrong*, and they're not identical – Antti Haapala -- Слава Україні Mar 14 '15 at 10:26
  • The python version has some typos: there is a `tex` variable at line 2 and a `count()` call at line 8. Also, the python version fails if `pattern` is shorter that `text`. If you are just looking for a string comparison algorithm, prolly you can achieve that without recursion. – user2464424 Mar 14 '15 at 10:42
  • @user2464424 Thanks, I corrected it (I started with the C algorithm and added the python one to widen the audience) – fons Mar 14 '15 at 10:48
  • @AnttiHaapala How are they wrong and non-identical? – fons Mar 14 '15 at 10:49
  • `if (text_size == 0) return 1;` shouldn't that be `if (pattern_size == 0) return 1;` ?? – joop Mar 17 '15 at 11:39
  • Yes, I will corrrect it, my bad. – fons Mar 17 '15 at 11:42
  • The other easy return would be `if (text_len < pattern_len) return 0;` , obviously. – joop Mar 17 '15 at 11:47
  • `if (text[0] == pattern[0])` This condition is loop invariant. Maybe you meant something like `if (text[i] == pattern[0])` ??? (plus some other, related changes) – joop Mar 17 '15 at 12:50
  • **1.** You have `if (text[i] == pattern[0]):` condition in Python and `if (text[0] == pattern[0])` in C — seems inconsistent. **2.** In C recurrence you shift the starting position of *both* strings, `text` and `pattern`, by `i`, although you compared the zeroth character of `pattern` to the `i`-th char of `text` — seems mismatched. **3.** In C recurrence you shift the starting position of both strings by `i` *but* you shorten them by `(i+1)` — inconsistency again. – CiaPan Mar 17 '15 at 13:12
  • You are right, I fixed the typo, sorry – fons Mar 17 '15 at 14:20
  • 2
    In the recursive call, should the parameters in C version `text+i, text_size-(i+1), pattern+i, pattern_size-(i+1)` be `text+i+1, text_size-(i+1), pattern+1, pattern_size-1`, according to Python version? The first, third and fourth parameters seem wrong O_O ... – Gassa Mar 17 '15 at 19:24

5 Answers5

2

My intuition that the complexity is O(length(text)^3) is incorrect. It is actually O(n!) purely because the implementation is of form

def do_something(relevant_length):
    # base case

    for i in range(relevant_length):
        # some constant time work

        do_something(relevant_length - 1)

as discussed in Example of O(n!)?

If memoization is used, the recursion tree is produced once and then subsequently looked up every time after.

Picture the shape of the recursion tree.

We make progress one character per layer. There are 2 base cases. The recursion bottoms out when we reach the end of pattern OR if there are no longer any characters in text through which to iterate. The first base case is explicit but the second base case just occurs given the implementation.

So the depth (height) of the recursion tree is min[length(text), length(pattern)].

How many subproblems? We also make progress one character per layer. If all characters in text were compared, using the Gauss trick for summing S = [n(n+1)] / 2, the total number of subproblems that will ever be evaluated, across all recursion layers, is {length(text) * [length(text) + 1]} / 2.

Take length(text) = 6 and length(pattern) = 10, where length(text) < length(pattern). The depth is min[length(text), length(pattern)] = 6.

PTTTTT
PTTTT
PTTT
PTT
PT
P

What about if length(text) = 10 and length(pattern) = 6, where length(text) > length(pattern). The depth is min[length(text), length(pattern)] = 6.

PTTTTTTTTT
PTTTTTTTT
PTTTTTTT
PTTTTTT
PTTTTT
PTTTT

What we see is that the length(pattern) doesn't really contribute to complexity analysis. In cases that length(pattern) < length(text), we're just hacking off a bit of the Gauss sum.

But, because text and pattern step forward together one for one, we end up doing much less work. The recursion tree looks like the diagonal of a square matrix.

For length(text) = 6 and length(pattern) = 10 as well as for length(text) = 10 and length(pattern) = 6, the tree is

P
 P
  P
   P
    P
     P

Hence, the complexity of the memoized approach is

O( min( length(text), length(pattern) ) )

Edit: Given @fons comment, what if recursion is never triggered? Specifically in the case when text[i] == pattern[0] for all i is never true. Then iterating through all of text is the dominating factor, even if length(text) > length(pattern).

So that implies the actual upper bound of the memoized approach is

O( max( length(text), length(pattern) ) )

Thinking about it a bit more, in the case when length(text) > length(pattern) and recursion IS triggered, even when pattern is exhausted, it takes constant time to recurse and check that pattern is now empty, so length(text) still dominates.

This makes the upper bound of te memoized version O(length(text)).

Community
  • 1
  • 1
chiaboy
  • 394
  • 2
  • 6
  • I understand the reasoning about the complexity of the original algorithm, and I think I agree. But I don't think I agree about the memoized version. For instance, for length(text) = t, length(pattern) = p, with t > p, and no match of p in t, the complexity is O(length(text)) – fons Mar 18 '15 at 12:35
  • @fons, looks like you're right. Assuming you instead meant "no match of t in p," because text[i] == pattern[0] for all i is never true, the recursion is actually always skipped. So it appears max instead of min is the upper bound for all cases. – chiaboy Mar 18 '15 at 12:56
  • Added an edit to the original answer after thinking about it a bit more to even eliminate the max of the two lengths. – chiaboy Mar 18 '15 at 13:11
0

Ehm... I could be wrong but as far as I see, your runtime should be focused on this loop:

for c in text:
    if (c == pattern[0]):
      # repeat the operation with the remaining string a pattern
      result += count_matches(text[1:], pattern[1:])

Basically let the length of your text be n, we don't need the length of the pattern.

The first time this loop is run (in the parent function) we will have n calls to it. Each of those n calls will in the worst case call n-1 instances of your program. Then those n-1 instances will in the worst case call n-2 instances and so on.

This results in an equation that is going to be n*(n-1)(n-2)...*1 which is n!. So your worst case runtime is O(n!). Pretty bad (:

I run your python program several times with input that would cause the worst case runtime:

In [21]: count_matches("aaaaaaa", "aaaaaaa")

Out[21]: 5040

In [22]: count_matches("aaaaaaaa", "aaaaaaaa")

Out[22]: 40320

In [23]: count_matches("aaaaaaaaa", "aaaaaaaaa")

Out[23]: 362880

The last input is 9 symbols and 9! = 362880.

To analyze the runtime of your algorithm you need to first think of the input that causes the worst possible runtime. In your algorithm best and worst vary quite a bit so you probably need average case analysis but that is quite complicated. (You would need to define what input is average and how often worst case would be seen.)

Dynamic programming can help alleviate your runtime quite a bit, but analysis is harder. Let's first code a simple unoptimized dynamic programming version:

cache = {}
def count_matches_dyn(text, pattern):
  if len(pattern) == 0: return 1

  result = 0
  for c in text:
    if (c == pattern[0]):
      # repeat the operation with the remaining string a pattern
      if ((text[1:], pattern[1:]) not in cache.keys()):
        cache[(text[1:], pattern[1:])] = count_matches_dyn(text[1:], pattern[1:])
        result += cache[(text[1:], pattern[1:])]
      else:
        result += cache[(text[1:], pattern[1:])]

  return result

Here we cache all calls to to count_matches in a dictionary so when we call count matches with the same input we will get the result instead of calling the function again. (This is known as memoization).

Now let's analyze it. The main loop

  for c in text:
    if (c == pattern[0]):
      # repeat the operation with the remaining string a pattern
      if ((text[1:], pattern[1:]) not in cache.keys()):
        cache[(text[1:], pattern[1:])] = count_matches_dyn(text[1:], pattern[1:])
        result += cache[(text[1:], pattern[1:])]
      else:
        result += cache[(text[1:], pattern[1:])]

Will run n times on the first call (our cache is empty). However the first recursive call will populate the cache:

cache[(text[1:], pattern[1:])] = count_matches_dyn(text[1:], pattern[1:])

And every other call in the same loop will cost (O(1). So basically the top level recursion will cost O(n-1) + (n-1)*O(1) = O(n-1) + O(n-1) = 2*O(n-1). You can see that from the calls further down the recursion only the first one will descend with many recursive calls (the O(n-1) call) and the rest will cost O(1) because they are just dictionary lookups. Given all that was said the runtime is (2*O(n-1) which is amortized to O(n).

Disclaimer. I am not entirely sure about the analysis of the dynamic programming version, please feel free to correct me (:

Disclaimer 2. The dynamic programming code contains expensive operations (text[1:], pattern[1:]) which are not factored in the analysis. This is done on purpose because in any reasonable implementation you can drastically reduce the cost of those calls. The point is to show how simple caching can drastically reduce runtime.

XapaJIaMnu
  • 1,408
  • 3
  • 12
  • 28
  • Big-O notation doesn't imply worst-case - it's just a way of describing the asymptotic behaviour of a function. It's just that, when analysing algorithmic complexity, the worst-case runtime is often what we're most interested in. – psmears Mar 17 '15 at 14:59
  • @psmears true, my bad. I'll fix it. – XapaJIaMnu Mar 17 '15 at 15:14
  • not quite true. `text[1:]`, `pattern[1:]` these two operations lead to an actual complexity of `O(n^2)`. and you keep creating them, that would be slow indeed. – Jason Hu Mar 17 '15 at 15:31
  • @HuStmpHrrr this is true, but I have done only a very simplified caching. In any reasonable implementation, those string operations would be much more optimized (in c, you could know the exact size beforehand and use an array copy). I believe the main point of the speedup offered by dynamic programming still stands. – XapaJIaMnu Mar 17 '15 at 15:39
  • agreed. i just want to stress that discussing complexity and actual implementation of algo in python are different. some minor operation in python may bring huge overhead while it's hard to attract attention. – Jason Hu Mar 17 '15 at 15:42
  • Why the downvotes?! When i wrote the reply the algorithm was different and the author had stated not to fix his algorithm and analyze it as it is.... – XapaJIaMnu Mar 17 '15 at 23:54
0
  • First, let us rise above the code and formulate the problem this code is trying to solve.

The Python version seems to count the number of occurrences of pattern as a subsequence of text. The C version currently looks broken, so I'll assume below that the Python version is right.

  • Then, look back at the code and note some general things about how the solution is carried out.

The function calculates the answer by adding up 0s and 1s. Thus the number of operations is at least the number of 1s one needs to add up to get the answer, that is, the answer itself.

  • Now, let us devise an input (text, pattern) which will give the worst possible runtime for given lengths of text and pattern.

The largest answer is clearly some case where all letters are equal.

  • After that, we use the above simplification of input and some knowledge of mathematics to calculate the answer directly.

When all letters are equal, the answer is essentially the number of ways to choose k = len (pattern) items (letters) out of n = len (text), which is choose (n, k).

  • Next, we pick lengths of text and pattern which give us the worst possible complexity.

By example: for text = 'a' * 100 and pattern = 'a' * 50, we have the answer choose (100, 50) = 100! / 50! / 50!. Generally, for a fixed length of text, the length of pattern must be half of that, rounded either side if necessary. It's an intuitive notion one gets when looking at Pascal's triangle. Formally, this is trivial to prove by comparing choose (n, k) and choose (n, k+-1) by hand.

  • Estimate the answer we got.

The sum choose (n, 0) + choose (n, 1) + ... + choose (n, n) is 2n, and intuitively again, choose (n, n/2) is a considerable fraction of that. More formally, by Stirling's formula, it turns out choose (n, n/2) is on the order of 2n divided by sqrt(n).

  • Finally, note that more detailed analysis is probably unnecessary.

When the complexity is exponential, we usually are less interested in precise polynomial factors. Say, 2100 (O (2^n)) and 100 times 2100 (O (n * 2^n)) operations are equally impossible to complete in reasonable time. What would matter is to reduce O (2^n) to O (2^(n/2)), or better, to find a polynomial solution.

  • Recall that what we found is a lower bound.

Actually, the complexity would indeed be choose (len (text), len (pattern) multiplied by some polynomial, if we add the following line at the top:

if len(pattern) < len(text): return 0

Indeed, there can be no match if the number of letters left in the text is less than the length of pattern.

  • Here is a view from another angle.

Otherwise, we can have a larger number of recursion branches which ultimately result in adding 0 to the answer.

By looking from another side, we can prove that the number of operations in unaltered code can be as high as 2 to the power of len(text).

Indeed, when text = 'a' * n and pattern = 'a' * n, suppose we already processed k letters of text. Each of these letters, independently from others, could have been either matched with some letter of pattern or left out in the loop. So, we have two ways to go for each letter of text, and so 2^n ways to go when we processed n letters of text, that is, arrived at a terminating call of our recursive function.

Community
  • 1
  • 1
Gassa
  • 8,546
  • 3
  • 29
  • 49
0

The time complexity should improve to something of the order of O(length(text) * length(pattern)) from the recursive one ( O(n!) ).

The memorized solution (DP) would involve building lookup table of text-vs-pattern which can be built up incrementally starting from the end of the text and pattern.

sray
  • 584
  • 3
  • 8
-1

I'm afraid your algorithm is incorrect for a pattern matching. Mainly because it will search for a sub-sub-string in the rest of the text once it will find that a first character matches. For example for the text "abbccc" and a pattern "accc", your algorithm will return a result equal to 1.

You should consider implementing the "Naive" Algorithm for pattern matching, which is very similar to what you were trying to do, but without recursion. Its complexity is O(n*m) where 'n' is the text length, and 'm' is the pattern length. In Python you could use the following implementation:

text = "aaaaabbbbcccccaaabbbcccc"
pattern = "aabb"
result = 0

    index = text.find(pattern)
    while index > -1:
        result += 1
        print index
        index = text.find(pattern, index+1)

return result

Regarding books on the subject, my best recommendation is Cormen's "Introduction to Algorithms", which covers all the material on algorithms and complexity.

Slava Bronfman
  • 160
  • 1
  • 6
  • The algorithm is intentionally like that, it's not wrong. For the text "abbccc" and a pattern "accc", the algorithm **is expected** to return a result equal to 1. I will remove the "pattern matching" part from the question title to make it less confusing. – fons Mar 14 '15 at 10:51