High performance mass short string search in Python

Question

The Problem: A large static list of strings is provided as A, A long string is provided as B, strings in A are all very short (a keywords list), I want to check if every string in A is a sub-string of B and get them.

Now I use a simple loop like:

result = []
for word in A:
    if word in B:
        result.append(word)

But it's crazy slow when A contains ~500,000 or more items.

Is there any library or algorithm that fits this problem? I've tried my best to search but no luck.

Thank you!

Just a theory - what if you try using `B.find(word)` instead of `if word in B`? I believe `in` is fast if the substring is really in `B`, but it will get much slower if it's not a substring. `find` might be faster. — wkl, Jan 13 '12 at 02:46
@birryree Thanks for the comment, but in my tests using a `B.find(word)` instead of `word in B` did not make any effort in performance :( — Felix Yan, Jan 13 '12 at 03:06

score 16 · Accepted Answer · edited May 23 '17 at 11:46

16

Your problem is large enough that you probably need to hit it with the algorithm bat.

Take a look into the Aho-Corasick Algorithm. Your problem statement is a paraphrase of the problem that this algorithm tackles.

Also, look into the work by Nicholas Lehuen with his PyTST package.

There are also references in a related Stack Overflow message that mention other algorithms such as Rabin-Karp: Algorithm for linear pattern matching?

edited May 23 '17 at 11:46

Community

1
1

answered Jan 13 '12 at 02:56

dyoo

11,795
1
34
44

+1 -- this is the answer. I thought of a [trie](http://en.wikipedia.org/wiki/Trie)-based approach, but this is even better. – senderle Jan 13 '12 at 03:15
Thank you so much that I've got it to work perfectly, here is the my test result:`2012-01-13 11:48:07.632212 Importing test cases` `2012-01-13 11:48:17.191975 Scaning using in` `2012-01-13 11:48:47.750070 Scan completed` `2012-01-13 11:48:47.752614 TSTing` `2012-01-13 11:48:56.780503 Scaning using tst` `2012-01-13 11:48:56.798033 Scan completed` – Felix Yan Jan 13 '12 at 03:49

Yuushi · Answer 2 · 2012-01-13T03:13:10.777

3

Depending on how long your long string is, it may be worth it to do something like this:

ls = 'my long string of stuff'
#Generate all possible substrings of ls, keeping only uniques
x = set([ls[p:y] for p in range(0, len(ls)+1) for y in range(p+1, len(ls)+1)])

result = []
for word in A:
    if word in x:
        result.append(word)

Obviously if your long string is very, very long then this also becomes too slow, but it should be faster for any string under a few hundred characters.

edited Jan 13 '12 at 03:13

answered Jan 13 '12 at 02:59

Yuushi

25,132
7
63
81

1

the OP pointed out a very important detail here: he's parsing Chinese characters. Therefore `ls='mylongstringofstuff'`, and won't map to a set between combinations of the `ls`'s indexes very usefully. – yurisich Jan 13 '12 at 03:04
@Droogans I posted this before he added that, but I still don't see the problem. Let `ls = '解析するためにいくつかの文字' as a random example of (Japanese) characters - the above still (almost) works (there's actually a small bug in what I've written that I'm trying to fix, but I think the idea is still fine). Edit: Bug should be fixed. – Yuushi Jan 13 '12 at 03:11
I apologize. If you're compiling your examples using kanji characters with no spaces, and are getting useful results, then I must not be understanding your code (or the problem) as clearly as I should. – yurisich Jan 13 '12 at 03:13
1

I was thinking of something along these lines, but only generating substrings of length up to the longest element of `A`. – David Z Jan 13 '12 at 03:14
@DavidZaslavsky Ah yes, that might be a good optimization to make. – Yuushi Jan 13 '12 at 03:16

score 1 · Answer 3 · answered Jan 13 '12 at 02:51

1

I don't know if this would be any quicker, but it's a lot more pythonic:

result = [x for x in A if x in B]

answered Jan 13 '12 at 02:51

Josh Smeaton

47,939
24
129
164

3

Yes it's more pythonic, but no quicker will happen :( – Felix Yan Jan 13 '12 at 02:57
(+1) I don't think that there's a better solution. (Other than using a performance-minded language, like C). – Brigand Jan 13 '12 at 03:07

score 1 · Answer 4 · answered Jan 13 '12 at 02:53

1

Pack up all the individual words of B into a new list, consisting of the original string split by ' '. Then, for each element in B, test for membership against each element of A. If you find one (or more), delete it/them from A, and quit as soon as A is empty.

It seems like your approach will blaze through 500,000 candidates without an opt-out set in place.

answered Jan 13 '12 at 02:53

yurisich

6,991
7
42
63

Sorry I did not make it clear, that the strings are in Chinese, so words are not separated by spaces. I will have to do much more work to find out "all the individual words of `B`". – Felix Yan Jan 13 '12 at 02:56
1

@FelixYan my last comment should then be my only useful advice to you; find a way to slim down your list of candidates as you comb through `B`. One less member in the outer `for` loop using `A` will speed up your search times, no matter how you go about doing it. – yurisich Jan 13 '12 at 02:58

score 1 · Answer 5 · edited Jan 13 '12 at 08:20

Assume you has all keywords of the same length (later you could extend this algo for different lengths)

I could suggest next:

precalculate some hash for each keyword (for example xor hash):
```
hash256 = reduce(int.__xor__, map(ord, keyword))
```
create a dictionary where key is a hash, and value list of corresponding keywords

save keyword length

curr_keyword = []
for x in B:
  if len(curr_keyword) == keyword_length:
     hash256 = reduce(int.__xor__, map(ord, curr_keyword))
     if hash256 in dictionary_of_hashed:
        #search in list

  curr_keyword.append(x)
  curr_keyword = curr_keyword[1:]

Something like this

High performance mass short string search in Python

5 Answers5

Linked