3

I have a large string and a list of search strings and want to build a boolean list indicating whether or not each of the search strings exists in the large string. What is the fastest way to do this in Python?

Below is a toy example using a naive approach, but I think it's likely there's a more efficient way of doing this.

e.g. the example below should return [1, 1, 0] since both "hello" and "world" exist in the test string.

def check_strings(search_list, input):
output = []
for s in search_list:
    if input.find(s) > -1:
        output.append(1)
    else:
        output.append(0)
return output

search_strings = ["hello", "world", "goodbye"] test_string = "hello world" print(check_strings(search_strings, test_string))

Danny Friar
  • 383
  • 4
  • 17
  • 1
    The proper solution would be to implement Rabin Karp algorithm for multiple keys at once. – enedil Jun 01 '17 at 14:45
  • Further to @enedil 's comment: https://stackoverflow.com/questions/22216948/python-rabin-karp-algorithm-hashing – Robᵩ Jun 01 '17 at 14:52
  • 1
    Came across a Python implementation of Aho Corasick whilst researching Rabin Karp which seems to solve it with a single pass through the test string: https://pypi.python.org/pypi/pyahocorasick/ – Danny Friar Jun 01 '17 at 15:30

4 Answers4

6

I can't say if this is the fastest, (this is still O(n*m)), but this is the way I would do it:

def check_strings(search_list, input_string):
    return [s in input_string for s in search_list]

The following program might be faster, or not. It uses a regular expression to make one pass through the input string. Note that you may you may want to use re.escape(i) in the re.findall() expression, or not, depending upon your needs.

def check_strings_re(search_string, input_string):
    import re
    return [any(l)
            for l in
            zip(*re.findall('|'.join('('+i+')' for i in search_string),
                            input_string))]

Here is a complete test program:

def check_strings(search_list, input_string):
    return [s in input_string for s in search_list]


def check_strings_re(search_string, input_string):
    import re
    return [any(l)
            for l in
            zip(*re.findall('|'.join('('+i+')' for i in search_string),
                            input_string))]


search_strings = ["hello", "world", "goodbye"]
test_string = "hello world"
assert check_strings(search_strings, test_string) == [True, True, False]
assert check_strings_re(search_strings, test_string) == [True, True, False]
Robᵩ
  • 163,533
  • 20
  • 239
  • 308
  • 1
    If the string is **really** large (like, several millions of characters) and the list of words to find is long, I would expect using regular expression might be faster, but certainly less elegant. – Błotosmętek Jun 01 '17 at 14:48
  • Agreed, this is a better implementation. I considered using regex but was unsure how to build the boolean list from a regular expression. It would be easy to compile a regex to check if __any__ of the strings are in the large string but couldn't think of a way to do the above without looping. – Danny Friar Jun 01 '17 at 15:14
  • @DannyFriar now you've got me interested, I'll take a go at it :-) – Błotosmętek Jun 01 '17 at 15:16
4

An implementation using the Aho Corasick algorithm (https://pypi.python.org/pypi/pyahocorasick/), which uses a single pass through the string:

import ahocorasick
import numpy as np

def check_strings(search_list, input):
    A = ahocorasick.Automaton()
    for idx, s in enumerate(search_list):
        A.add_word(s, (idx, s))
    A.make_automaton()

    index_list = []
    for item in A.iter(input):
        index_list.append(item[1][0])

    output_list = np.array([0] * len(search_list))
    output_list[index_list] = 1
    return output_list.tolist()

search_strings = ["hello", "world", "goodbye"]
test_string = "hello world"
print(check_strings(search_strings, test_string))
Danny Friar
  • 383
  • 4
  • 17
  • If it isn't too much trouble, could you post timeit results of all discussed implementations for your "real life" data? – Błotosmętek Jun 01 '17 at 15:56
2

I post it just for comparison. My comparing code:

#!/usr/bin/env python3
def gettext():
    from os import scandir
    l = []
    for file in scandir('.'):
        if file.name.endswith('.txt'):
            l.append(open(file.name).read())
    return ' '.join(l)

def getsearchterms():
    return list(set(open('searchterms').read().split(';')))

def rob(search_string, input_string):
    import re
    return [any(l)
            for l in
            zip(*re.findall('|'.join('('+i+')' for i in search_string),
                            input_string))]

def blotosmetek(search_strings, input_string):
    import re
    regexp = re.compile('|'.join([re.escape(x) for x in search_strings]))
    found = set(regexp.findall(input_string))
    return [x in found for x in search_strings]

def ahocorasick(search_list, input):
    import ahocorasick
    import numpy as np
    A = ahocorasick.Automaton()
    for idx, s in enumerate(search_list):
        A.add_word(s, (idx, s))
    A.make_automaton()

    index_list = []
    for item in A.iter(input):
        index_list.append(item[1][0])

    output_list = np.array([0] * len(search_list))
    output_list[index_list] = 1
    return output_list.tolist()

def naive(search_list, text):
    return [s in text for s in search_list]

def test(fn, args):
    start = datetime.now()
    ret = fn(*args)
    end = datetime.now()
    return (end-start).total_seconds()

if __name__ == '__main__':
    from datetime import datetime
    text = gettext()
    print("Got text, total of", len(text), "characters")
    search_strings = getsearchterms()
    print("Got search terms, total of", len(search_strings), "words")

    fns = [ahocorasick, blotosmetek, naive, rob]
    for fn in fns:
        r = test(fn, [search_strings, text])
        print(fn.__name__, r*1000, "ms")

I used different words that appear in Leviathan as search terms and concatenated 25 most downloaded books from Project Gutenberg as search string. Results are as follows:

Got text, total of 18252025 characters
Got search terms, total of 12824 words
ahocorasick 3824.111 milliseconds
Błotosmętek 360565.542 milliseconds
naive 73765.67 ms

Robs version runs already for about an hour and still doesn't finish. Maybe it's broken, maybe it's simply painfully slow.

enedil
  • 1,605
  • 4
  • 18
  • 34
  • OK, so ahocorasick is faster by two orders of magnitude than regexp, this is something worth remembering. Thanks. – Błotosmętek Jun 02 '17 at 07:58
  • @Błotosmętek I bet the more data, the difference is more noticeable - ahocorasick will be perhaps 10**10 times faster than regexp if given enough data. Notice how naive approach is still faster than regexp. As in an old saying, use regexp and you now have two problems. – enedil Jun 02 '17 at 11:53
  • BTW I believe this is so because the ahocorasick module is written in C. Would the algorithm be implemented in pure Python, I don't think it would be so blazingly fast. – Błotosmętek Jun 02 '17 at 13:03
  • Extension here with multiple strings to search: https://stackoverflow.com/questions/44744895/fastest-way-to-build-string-feature-vectors-python – Danny Friar Jun 25 '17 at 09:24
1

My version using regular expressions:

def check_strings(search_strings, input_string):
    regexp = re.compile('|'.join([re.escape(x) for x in search_strings]))
    found = set(regexp.findall(input_string))
    return [x in found for x in search_strings]

On the test data provided by original poster it is by an order of magnitude slower than Rob's pretty solution, but I'm going to do some benchmarking on a bigger sample.

Błotosmętek
  • 12,717
  • 19
  • 29
  • Hmm, what if you try it with GNU grep? There's a known but in Python and Perl implementation of regex that makes some queries exponential. – enedil Jun 01 '17 at 17:28
  • 1
    Hey, if you're interested - look at my answer. – enedil Jun 01 '17 at 19:56