How is the most efficient way to intersect a list of strings with a numpy array of matches?

Question

I am using aho corasick to performing some string searches on documents. The original code uses numpy array to store in an efficient way the matches of each string of a string list:

import ahocorasick
import numpy as np

def ahocorasickFind(search_list, input):
    A = ahocorasick.Automaton()
    for idx, s in enumerate(search_list):
        A.add_word(s, (idx, s))
    A.make_automaton()

    index_list = []
    for item in A.iter(input):
        print(item)
        index_list.append(item[1][0])

    output_list = np.array([0] * len(search_list))
    output_list[index_list] = 1
    return output_list.tolist()

search_list = ['joão','maria','thiago'] # thousands of words in the real code
result = ahocorasickFind(search_list,'asdasdasd joão 1231231 thiago') # huge text in the real code
for index, element in enumerate(result):
    if(element == 1):
        print(search_list[index])

Using the above approach took to much time and memory to iterate and test (if == 1). So, how to get "original" strings found in the input text in a perfomatic way?

score 1 · Accepted Answer · answered Jan 05 '22 at 09:25

If you are only interested in matching for words (i.e. separated by a white space), rather than using a full search text, it might be faster to use a set of words. Note, however, that this uses some additional memory. One straightforward solution to replicate your behaviour would be:

words = set(text.split())
for w in search_list:
    if w in words:
        print(w)

or even shorter (but changing the order of the result, and deleting duplicates from the search list):

for w in set(search_list).intersection(text.split()):
    print(w)

I've quickly tested it on relatively large text object (143M characters, 23M words) and a rather short search_list object (606 words, of which 295 unique ones), and the times I got are:

corasick: 14.5s
first version above: 4.6s
second version above: 2.6s (this speedup is just due to doing half the work only by skipping duplicates)

However the first version uses a (relatively) negligible amount of additional memory, while the other versions use quite a lot of it (for the data I was using, could be almost 2GB of additional memory)

Wow, I did not know that a simple for + in would be faster than corasick... Could you test if using list comprehension will affect the performance? — celsowm, Jan 05 '22 at 17:48
The main trick here is that the list of words (returned from `split()`) has been converted to a set, and looking if an object is in a set is O(1) independently of how many words are in there. I don't know corasick, but I suspect it checks general substrings, hence why it can be slower. What do you mean by list comprehension? To replace the `for` loop? If so it won't change anything, the list/set the loop above is iterating over is "tiny" (a few hundred words), it doesn't affect runtime at all. Well over 99% of the runtime is taken to apply `split()` and to convert the resulting list to a set. — Marco Spinaci, Jan 05 '22 at 20:24

How is the most efficient way to intersect a list of strings with a numpy array of matches?

1 Answers1