18

I read this article Retiring a Great Interview Problem, the author came up with a "word break" problem and gave three solutions. The efficient one uses memoization algorithm and the author said its worst case time complexity is O(n^2) since "the key insight is that SegmentString is only called on suffixes of the original input string, and that there are only O(n) suffixes".

However, I find it difficult for me to understand why it is O(n^2). Could someone please give me a hint or proof?

Word Break Problem:

    Given an input string and a dictionary of words,
    segment the input string into a space-separated
    sequence of dictionary words if possible. For
    example, if the input string is "applepie" and
    dictionary contains a standard set of English words,
    then we would return the string "apple pie" as output.

The memoization algorithm from Retiring a Great Interview Problem

Map<String, String> memoized;

String SegmentString(String input, Set<String> dict) {
      if (dict.contains(input)) 
          return input;
      if (memoized.containsKey(input) {
          return memoized.get(input);
      }
      int len = input.length();
      for (int i = 1; i < len; i++) {
          String prefix = input.substring(0, i);
          if (dict.contains(prefix)) {
              String suffix = input.substring(i, len);
              String segSuffix = SegmentString(suffix, dict);
              if (segSuffix != null) {
                  return prefix + " " + segSuffix;
              }
          }
      }
      memoized.put(input, null);
      return null;
}
Will Ness
  • 70,110
  • 9
  • 98
  • 181
mitchelllc
  • 1,607
  • 4
  • 20
  • 24
  • The solution is not optimal, it only update the memoized map when the result of string input is null, should update in all cases! – Pham Trung Jan 22 '14 at 04:06
  • 2
    It's correct. Note that `segSuffix` is only called recusively if `prefix` is in a word in `dict`. If that recursive call returns non-`null`, the caller then immediately returns as well, on up the stack. In other words, as soon as a `segSuffix` call returns a value other than `null`, the algorithm never calls `segSuffix` again, so the non-`null` value doesn't need to be memoized. – Alp Jan 22 '14 at 04:11
  • @Alp Nice, thanks. In this case, only an array of boolean will be enough I guess,as we can always refer to the suffix as its index in the whole input string. – Pham Trung Jan 22 '14 at 04:22
  • Could you guys help me explain why it is `O(n^2)`? – mitchelllc Jan 22 '14 at 04:47

4 Answers4

13

Intuition

(it is clear that the worst case is the one where there are no solutions, so I focus on that)

Since the recursive call is made before putting values in the memoization-cache, the last (shortest) suffixes will get cached first. This is because the function is first called with a string of length N, which then calls itself with a string of length N-1, which then .... , with a string of len 0, which is cached and returns, then the string of length 1 is cached and returns, ..., length N is cached and returns.

As the hint suggests, only suffixes get cached, and there are only N of them. This means that by the time the top-level function gets the result of its first recursive call (i.e. on a suffix of length N-1), the cache is already populated with the N-1 suffixes.

Now, assuming the N-1 last suffixes are already cached, the for-loop needs to make N-1 recursive calls, each taking O(1) (since the answer is already cached), totalling O(N). However, (pre-) building the cache of the last N-1 takes O(N^2) (explained below), totalling with O(N)+O(N^2) = O(N^2).


Proof by mathematical induction

This explanation can easily be translated to a formal proof using induction. Here is the gist of it:

(f(N) being the number of operations required for the function to complete on an input of length N)

Induction hypothesis -- there exists a constant c s.t. f(N) < c*N^2.

The base case is trivial -- for a string of length 1, you can find a constant c such that f(1) < c*N^2 = c

Induction step

Recap of the order things happen:

Step 1: the function first calls itself on the suffix of length N-1, building a cache containing the answer for the last N-1 suffixes

Step 2: the function then calls itself O(N) more times, each taking O(1) (thanks to this test: if (memoized.containsKey(input)), and the fact that the cache has already been populated in Step 1).

So we get:

f(N+1) = f(N) + a*N <   (by the hypothesis)
       < c*N^2 + a*N <  (for c large enough)
       < c*N^2 + 2*c*N + c =
       = c*(N+1)^2

Thus we got f(N+1) < c*(N+1)^2, which completes the proof.

shx2
  • 61,779
  • 13
  • 130
  • 153
3

In General, the dimensions of the memoized table decide the complexity.

For a memoized table of only 1 dimension (memoized[n]) -> takes O(n) complexity, For a memoized table of only 2 dimensions (memoized[n][n]) -> takes O(n^2) complexity and so on.

Reason: In case of memoization, worst case complexity is the running time of the inputs for which none of its cases (overlapping sub-problems) are not cached (pre-calculated) yet.

Now, lets say the memoization table has 2 dimensions (memoization[n][n]). Worst case complexity can only be the maximum of dimensions of the memoization table. Hence, its worst case can be only reach upto O(n^2).

Thus the dimensions of the Memoization table basically decides the worst case time complexity.

Answer of @shx2 explains this mathematically.

1

At first, for a given string with length N, we can break it into N*(N-1)/2 segments, and check whether each segment is contained in the dictionary. This cost is O(N^2)

Come back to the your code, begin with the string N, we break it down into two smaller strings, length 'a', and 'N - a'. And for each sub string (or prefix) start from 0 and end at 'a', we only check it once!

From each segment N - a, it also check each of its prefix once and store it into the memorized table, so, this step will make sure that the next time, when we made the same move, split the string at this exactly place, we only need to return the result, without further work (With assumption that a map will retrieve and return the result for a particular string in O(1)). This storing and retrieving step also makes sure that the conquer and dividing is only done once.

So, after first and second point, we conclude that there is only N*(N - 1)/2 segment to be examined at only once time, which lead to the fact that the cost is O(N^2)

Note: Only with the assumption for cost of both dict.contains(input) and memoized.containsKey(input) are O(1) so the complexity is O(N^2).

Pham Trung
  • 11,204
  • 2
  • 24
  • 43
  • Could you explain why we can break it into `N*(N-1)/2 segments`? I think it should be `2^N`, since for each letter in the input string, it may or may not contained in the segment. – mitchelllc Jan 23 '14 at 03:12
  • @mitchelllc to make a segment, we need one start and one end point, so at index `i`, the end points of segment starting from it is `i+1`, `i+2` ,... So, index 0 can have n segments, index 1 have n-1 segments ... And the formula for sum from n to 1 is n*(n-1)/2 – Pham Trung Jan 23 '14 at 03:43
  • well, I am still confused. As the author in the article explained, `which reduces the analysis to determine the number of possible segmentations. I leave it as an exercise to the reader (with this hint) to determine that this number is O(2^n).` (see `General solution` in http://thenoisychannel.com/2011/08/08/retiring-a-great-interview-problem/) – mitchelllc Jan 23 '14 at 04:12
  • 1
    @mitchelllc Because the map has stored all the possibilities which help you to avoid redoing any step, and there are O(N^2) cases in total, each is only calculated once, So we have the conclusion. Math is not work here as a recursive formula is not enough, in some cases, it takes into account some previous steps (for example split the string at index 1 , 4, 5, return and then split again at index 2,4,5. Two later splits are recalculated), thus make the result of the formula higher than it should be. Try to remove the map and you can see how the index is traversed. – Pham Trung Jan 23 '14 at 04:51
  • @mitchelllc After a reread for shx2 answer, I think his answer is correct :) The key point is focus on the first step of the function – Pham Trung Jan 23 '14 at 05:30
-1

If the input is of length N, the then maximum number of dictionary words this could contain would be N. so u need to check all different combinations of length 1..N. which is N(N-1)/2

take a input string of, aaaaa

there are N x 'a' strings

N-1 'aa' strings

and so on

Pita
  • 1,444
  • 1
  • 19
  • 29