0

I have already seen this answer to a similar question: https://stackoverflow.com/a/44311921/5881884

Where the ahocorasick algorithm is used to show if each word in a list exists in a string or not with O(n). But I want to get the frequency of each word in a list in a string.

For example if

my_string = "some text yes text text some"
my_list = ["some", "text", "yes", "not"]

I would want the result:

[2, 3, 1, 0]

I did not find an exact example for this in the documentation, any idea how to accomplish this?

Other O(n) solutions than using ahocorasick would also be appreciated.

DevB2F
  • 4,674
  • 4
  • 36
  • 60

4 Answers4

1

Implementation:

Here's an Aho-Corasick frequency counter:

import ahocorasick

def ac_frequency(needles, haystack):
    frequencies = [0] * len(needles)
    # Make a searcher
    searcher = ahocorasick.Automaton()
    for i, needle in enumerate(needles):
        searcher.add_word(needle, i)
    searcher.make_automaton()
    # Add up all frequencies
    for _, i in searcher.iter(haystack):
        frequencies[i] += 1
    return frequencies

(For your example, you'd call ac_frequency(my_list, my_string) to get the list of counts)

For medium-to-large inputs this will be substantially faster than other methods.

Notes:

For real data, this method will potentially yield different results than the other solutions posted, because Aho-Corasick looks for all occurrences of the target words, including substrings.

If you want to find full-words only, you can call searcher.add_word with space/punctuation-padded versions of the original string:

    ...
    padding_start = [" ", "\n", "\t"]
    padding_end = [" ", ".", ";", ",", "-", "–", "—", "?", "!", "\n"]
    for i, needle in enumerate(needles):
        for s, e in [(s,e) for s in padding_start for e in padding_end]:
            searcher.add_word(s + needle + e, i)
    searcher.make_automaton()
    # Add up all frequencies
    for _, i in searcher.iter(" " + haystack + " "):
    ...
Ollin Boer Bohan
  • 2,296
  • 1
  • 8
  • 12
  • It almost works perfectly, but the suggestion in notes doesn't find words at beginning of a sentence. If I add: searcher.add_word(needle + " ", i) it will count the same instance twice. Is it not possible to use some regex to make sure it only finds the exact word? – DevB2F Jul 31 '18 at 18:44
  • I've updated the version in the notes to be a more complete solution for this use-case. It should catch words at the start/end of the string (by padding the haystack) and words immediately after/before line breaks. – Ollin Boer Bohan Jul 31 '18 at 19:47
1

The Counter in the collections module may be of use to you:

from collections import Counter

my_string = "some text yes text text some"
my_list = ["some", "text", "yes", "not"]

counter = Counter(my_string.split(' '))
[counter.get(item, 0) for item in my_list]

# out: [2, 3, 1, 0]
dmmfll
  • 2,666
  • 2
  • 35
  • 41
  • what would the complexity be? – DevB2F Jul 31 '18 at 20:06
  • I was wondering the same. Am doing some investigation into it because I am not qualified to say. Counter is optimized. See this: https://stackoverflow.com/a/27802189/1913726 – dmmfll Aug 01 '18 at 12:31
  • I did some %%timeit tests and plotted the results for splitting strings. Splitting a string is O(n) according to those results. I am assuming that the lookup on a Counter instance is going to be O(1). – dmmfll Aug 01 '18 at 14:11
  • 1
    If your string is actually just a list of items (like in this case we are looking for full-word match, which makes the string a list of words), this is the best approach. – justhalf Apr 19 '21 at 05:35
0

You can use list comprehensions to count the number of times the specific list occurs in my_string:

[my_string.split().count(i) for i in my_list]
[2, 3, 1, 0]
Onyambu
  • 67,392
  • 3
  • 24
  • 53
  • 1
    This actually costs O(n*m) because the `count()` method itself costs O(n), and you're doing it for every item in `my_list`, which costs O(m). – blhsing Jul 31 '18 at 02:12
0

You can use a dictionary to count the occurrences of the words you care about:

counts = dict.fromkeys(my_list, 0) # initialize the counting dict with all counts at zero

for word in my_string.split():
    if word in counts:     # this test filters out any unwanted words
        counts[word] += 1  # increment the count

The counts dict will hold the count of each word. If you really do need a list of counts in the same order as the original list of keywords (and the dict won't do), you can add a final step after the loop has finished:

results = [counts[word] for word in my_list]
Blckknght
  • 100,903
  • 11
  • 120
  • 169