8

The text of Alice in Wonderland contains the word 'Wonderland' 8 times. (Let's be case-insensitive for this question).

However it contains the word many more times if you count non-contiguous subsequences as well as substrings, eg.

Either the well was very deep, or she fell very slowly, for she had plenty of time as she went down to look about her and to WONDER what was going to happen next. First, she tried to Look down AND make out what she was coming to, but it was too dark to see anything;

(A subsequence is a sequence that can be derived from another sequence by deleting some elements without changing the order of the remaining elements. —Wikipedia)

How many times does the book contain the word Wonderland as a subsequence? I expect this will be a big number—it's a long book with many w's and o's and n's and d's.

I tried brute force counting (recursion to make a loop 10 deep) but it was too slow, even for that example paragraph.

Colonel Panic
  • 132,665
  • 89
  • 401
  • 465

3 Answers3

10

Let's say you didn't want to search for wonderland, but just for w. Then you'd simply count how many times w occurred in the story.

Now let's say you want wo. For each first character of the current pattern you find, you add to your count:

  1. How many times the current pattern without its first character occurs in the rest of the story, after this character you're at: so you have reduced the problem (story[1..n], pattern[1..n]) to (story[2..n], pattern[2..n])

  2. How many times the entire current pattern occurs in the rest of the story. So you have reduced the problem to (story[2..n], pattern[1..n])

Now you can just add the two. There is no overcounting if we talk in terms of subproblems. Consider the example wawo. Obviously, wo occurs 2 times. You might think the counting will go like:

  1. For the first w, add 1 because o occurs once after it and another 1 because wo occurs once after it.

  2. For the second w, add 1 because o occurs once after it.

  3. Answer is 3, which is wrong.

But this is what actually happens:

(wawo, wo) -> (awo, o) -> (wo, o) -> (o, o) -> (-, -) -> 1
                                            -> (-, o) -> 0
           -> (awo, wo) -> (wo, wo) -> (o, wo) -> (-, wo) -> 0
                                    -> (o, o) -> (-, -) -> 1
                                              -> (-, o) -> 0

So you can see that the answer is 2.

If you don't find a w, then the count for this position is just how many times wo occurs after this current character.

This allows for dynamic programming with memoization:

count(story_index, pattern_index, dp):
  if dp[story_index, pattern_index] not computed:
    if pattern_index == len(pattern):
      return 1
    if story_index == len(story):
      return 0

    if story[story_index] == pattern[pattern_index]:
      dp[story_index, pattern_index] = count(story_index + 1, pattern_index + 1, dp) + 
                                       count(story_index + 1, pattern_index, dp) 
    else:
      dp[story_index, pattern_index] = count(story_index + 1, pattern_index, dp)

  return dp[story_index, pattern_index]

Call with count(0, 0, dp). Note that you can make the code cleaner (remove the duplicate function call).

Python code, with no memoization:

def count(story, pattern):
  if len(pattern) == 0:
    return 1
  if len(story) == 0:
    return 0

  s = count(story[1:], pattern)
  if story[0] == pattern[0]:
    s += count(story[1:], pattern[1:])

  return s

print(count('wonderlandwonderland', 'wonderland'))

Output:

17

This makes sense: for each i first characters in the first wonderland of the story, you can group it with remaining final characters in the second wonderland, giving you 10 solutions. Another 2 are the words themselves. The other five are:

wonderlandwonderland
*********    *
********    **
********    *      *
**      **    ******
***      *    ****** 

You're right that this will be a huge number. I suggest that you either use large integers or take the result modulo something.

The same program returns 9624 for your example paragraph.

IVlad
  • 43,099
  • 13
  • 111
  • 179
  • I don't understand number 2. Aren't you double counting those "wo"s? Each such "wo" will be counted as part of number 1. for the later "w", no? – Edward Doolittle Jul 24 '15 at 17:00
  • @EdwardDoolittle I don't really understand what you mean. Number 1 counts, for each `w`, how many `o`s are after it, forming a `wo`. It also counts how many `wo` are after it (so also considering the next `w`s). Yes, the `w` in 1 will be counted as making a `wo` with an `o` in future `wo`s counted by 2, but they are distinct, so valid, at least according to my understanding of the problem. – IVlad Jul 24 '15 at 17:07
  • I read your algorithm as "for each 'w' you find { count how many times 'o' appears in the rest of the story, and how many times 'wo' appears in the rest of the story }". I can see what you mean is a different way of bracketing those statements ... I think? – Edward Doolittle Jul 28 '15 at 18:30
  • @EdwardDoolittle I realize my explanation was a bit confusing because it did not mention that the pattern we're looking for also changes. It's not for each `w`, it's for each first character of the current pattern (my implementation does not have this bug). Added more explanations. Please let me know if this is clearer. – IVlad Jul 28 '15 at 19:27
3

The string "wonderland" occurs as a subsequence in Alice in Wonderland1 24100772180603281661684131458232 times.

The main idea is to scan the main text character by character, keeping a running count of how often each prefix of the target string (i.e.: in this case, "w", "wo", "won", ..., "wonderlan", and "wonderland") has occurred up to the current letter. These running counts are easy to compute and update. If the current letter does not occur in "wonderland", then the counts are left untouched. If the current letter is "a" then we increment the count of "wonderla"s seen by the number of "wonderl"s seen up to this point. If the current letter is "n" then we increment the count of "won"s by the count of "wo"s and the count of "wonderlan"s by the count of "wonderla"s. And so forth. When we reach end of the text, we will have the count of all prefixes of "wonderland" including the string "wonderland" itself, as desired.

The advantage of this approach is that it requires a single pass through the text and does not require O(n) recursive calls (which will likely exceed the maximum recursion depth unless you do something clever).

Code

import fileinput
import string

target = 'wonderland'

prefixes = dict()
count = dict()

for i in range(len(target)) :
    letter = target[i]
    prefix = target[:i+1]
    if letter not in prefixes :
        prefixes[letter] = [prefix]
    else :
        prefixes[letter].append(prefix)
    count[prefix] = 0L

for line in fileinput.input() :
    for letter in line.lower() :
        if letter in prefixes :
            for prefix in prefixes[letter] :
                if len(prefix) > 1 :
                    count[prefix] = count[prefix] + count[prefix[:len(prefix)-1]]
                else:
                    count[prefix] = count[prefix] + 1

print count[target]
  1. Using this text from Project Gutenberg, starting with "CHAPTER I. Down the Rabbit-Hole" and ending with "THE END"
mhum
  • 2,928
  • 1
  • 16
  • 11
0

Following up on previous comments, if you are looking for an algorithm that would return 2 for the input wonderlandwonderland and 1 for wonderwonderland, then I think you could adapt the algorithm from this question:

How to find smallest substring which contains all characters from a given string?

Effectively, the change in your case would be that, once an instance of the word is found, you increment a counter and repeat all the procedure with the remaining part of the text.

Such algorithm would be O(n) in time when n is the lenght of the text and O(m) in space where m is the length of the searched string.

Community
  • 1
  • 1
mdm
  • 3,928
  • 3
  • 27
  • 43