4

I am curious what is the most efficient algorithm (or commonly used) to count the number of occurrences of a string in a chunk of text.

From what I read, the Boyer–Moore string search algorithm is the standard for string searches but I am not sure if counting occurrences in an efficient way would be same as searching a string.

In Python this is what I want:

text_chunck = "one two three four one five six one"
occurance_count(text_chunck, "one") # gives 3.

EDIT: It seems like python str.count serves as such a method; however, I am not able to find what algorithm it uses.

templatetypedef
  • 362,284
  • 104
  • 897
  • 1,065
Hellnar
  • 62,315
  • 79
  • 204
  • 279
  • 1
    If the string you're searching for is "aa" and your text is "aaaa", does that count as two or three occurances? – tloflin May 04 '10 at 18:44
  • 1
    No, that wasn't a yes or no question: is it two or is it three? – tloflin May 04 '10 at 18:51
  • oh sorry, I will have my exact keywords to count occurances on humanly typed content thus it really doesn't matter much as its occurance will be really low and even it happens, it is not critical. – Hellnar May 04 '10 at 18:55
  • Well, if it *really* doesn't matter, then Boyer-Moore (or any other published algorithm) is overkill. You can do it in O(n) (where n is the length of the text) with a naive rolling match; i.e. check the current text char against the current string char, if it's a match, advance both to the next char, otherwise advance text only and reset string to the first char. If you get to the end of the string, reset it and increment your occurrence count. This will only give a rough estimate (it won't find many edge cases), but you said that didn't matter. – tloflin May 04 '10 at 19:04
  • Also, there is almost certainly already a library for your chosen language that will do this for you. You should probably just go with that. Premature optimization and all that. – tloflin May 04 '10 at 19:07

3 Answers3

3

For starters, yes, you can accomplish this with Boyer-Moore very efficiently. However, depending on some other parameters of your problem, there might be a better solution.

The Aho-Corasick string matching algorithm will find all occurrences of a set of pattern strings in a target string and does so in time O(m + n + z), where m is the length of the string to search, n is the combined length of all the patterns to match, and z is the total number of matches produced. This is linear in the size of the source and target strings if you just have one string to match. It also will find overlapping occurrences of the same string. Moreover, if you want to check how many times a set of strings appears in some source string, you only need to make one call to the algorithm. On top of this, if the set of strings that you want to search for never changes, you can do the O(n) work as preprocessing time and then find all matches in O(m + z).

If, on the other hand, you have one source string and a rapidly-changing set of substrings to search for, you may want to use a suffix tree. With O(m) preprocessing time on the string that you will be searching in, you can, in O(n) time per substring, check how many times a particular substring of length n appears in the string.

Finally, if you're looking for something you can code up easily and with minimal hassle, you might want to consider looking into the Rabin-Karp algorithm, which uses a roling hash function to find strings. This can be coded up in roughly ten to fifteen lines of code, has no preprocessing time, and for normal text strings (lots of text with few matches) can find all matches very quickly.

Hope this helps!

templatetypedef
  • 362,284
  • 104
  • 897
  • 1,065
1

Boyer-Moore would be a good choice for counting occurrences, since it has some overhead that you would only need to do once. It does better the longer the pattern string is, so for "one" it would not be a good choice.

If you want to count overlaps, start the next search one character after the previous match. If you want to ignore overlaps, start the next search the full pattern string length after the previous match.

If your language has an indexOf or strpos method for finding one string in another, you can use that. If it proves to slow, then choose a better algorithm.

drawnonward
  • 53,459
  • 16
  • 107
  • 112
-1

Hellnar, You can use a simple dictionary to count occurrences in a String. The algorithm is a counting algorithm, here is an example:

"""
The counting algorithm is used to count the occurences of a character
in a string. This allows you to compare anagrams and strings themselves.
ex. animal, lamina a=2,n=1,i=1,m=1
"""

def count_occurences(str):
  occurences = {}
  for char in str:
    if char in occurences:
      occurences[char] = occurences[char] + 1
    else:
      occurences[char] = 1
  return occurences

  def is_matched(s1,s2):
    matched = True
    s1_count_table = count_occurences(s1)

    for char in s2:
      if char in s1_count_table and s1_count_table[char]>0:
      s1_count_table[char] -= 1
    else:
      matched = False
      break
    return matched

  #counting.is_matched("animal","laminar")

This example just returns True or False if the strings match. Keep in mind, this algorithm counts the number of times a character shows up in a string, this is good for anagrams.

newfront
  • 127
  • 4
  • This doesn't work correctly for the problem. First, it only reports true/false instead of a number of matches, which is what the OP was asking for. If you were to search a large text corpus (say, the New York Times) for all occurences of some string, then you would almost certainly have a false positive for any string, since your algorithm just checks if the letters of the string occur somewhere in the source text. – templatetypedef Aug 26 '11 at 17:59