15

The Problem: A large static list of strings is provided as A, A long string is provided as B, strings in A are all very short (a keywords list), I want to check if every string in A is a sub-string of B and get them.

Now I use a simple loop like:

result = []
for word in A:
    if word in B:
        result.append(word)

But it's crazy slow when A contains ~500,000 or more items.

Is there any library or algorithm that fits this problem? I've tried my best to search but no luck.

Thank you!

Felix Yan
  • 14,841
  • 7
  • 48
  • 61
  • Just a theory - what if you try using `B.find(word)` instead of `if word in B`? I believe `in` is fast if the substring is really in `B`, but it will get much slower if it's not a substring. `find` might be faster. – wkl Jan 13 '12 at 02:46
  • @birryree Thanks for the comment, but in my tests using a `B.find(word)` instead of `word in B` did not make any effort in performance :( – Felix Yan Jan 13 '12 at 03:06

5 Answers5

16

Your problem is large enough that you probably need to hit it with the algorithm bat.

Take a look into the Aho-Corasick Algorithm. Your problem statement is a paraphrase of the problem that this algorithm tackles.

Also, look into the work by Nicholas Lehuen with his PyTST package.

There are also references in a related Stack Overflow message that mention other algorithms such as Rabin-Karp: Algorithm for linear pattern matching?

Community
  • 1
  • 1
dyoo
  • 11,795
  • 1
  • 34
  • 44
  • +1 -- this is the answer. I thought of a [trie](http://en.wikipedia.org/wiki/Trie)-based approach, but this is even better. – senderle Jan 13 '12 at 03:15
  • Thank you so much that I've got it to work perfectly, here is the my test result:`2012-01-13 11:48:07.632212 Importing test cases` `2012-01-13 11:48:17.191975 Scaning using in` `2012-01-13 11:48:47.750070 Scan completed` `2012-01-13 11:48:47.752614 TSTing` `2012-01-13 11:48:56.780503 Scaning using tst` `2012-01-13 11:48:56.798033 Scan completed` – Felix Yan Jan 13 '12 at 03:49
3

Depending on how long your long string is, it may be worth it to do something like this:

ls = 'my long string of stuff'
#Generate all possible substrings of ls, keeping only uniques
x = set([ls[p:y] for p in range(0, len(ls)+1) for y in range(p+1, len(ls)+1)])

result = []
for word in A:
    if word in x:
        result.append(word)

Obviously if your long string is very, very long then this also becomes too slow, but it should be faster for any string under a few hundred characters.

Yuushi
  • 25,132
  • 7
  • 63
  • 81
  • 1
    the OP pointed out a very important detail here: he's parsing Chinese characters. Therefore `ls='mylongstringofstuff'`, and won't map to a set between combinations of the `ls`'s indexes very usefully. – yurisich Jan 13 '12 at 03:04
  • @Droogans I posted this before he added that, but I still don't see the problem. Let `ls = '解析するためにいくつかの文字' as a random example of (Japanese) characters - the above still (almost) works (there's actually a small bug in what I've written that I'm trying to fix, but I think the idea is still fine). Edit: Bug should be fixed. – Yuushi Jan 13 '12 at 03:11
  • I apologize. If you're compiling your examples using kanji characters with no spaces, and are getting useful results, then I must not be understanding your code (or the problem) as clearly as I should. – yurisich Jan 13 '12 at 03:13
  • 1
    I was thinking of something along these lines, but only generating substrings of length up to the longest element of `A`. – David Z Jan 13 '12 at 03:14
  • @DavidZaslavsky Ah yes, that might be a good optimization to make. – Yuushi Jan 13 '12 at 03:16
1

I don't know if this would be any quicker, but it's a lot more pythonic:

result = [x for x in A if x in B]
Josh Smeaton
  • 47,939
  • 24
  • 129
  • 164
1

Pack up all the individual words of B into a new list, consisting of the original string split by ' '. Then, for each element in B, test for membership against each element of A. If you find one (or more), delete it/them from A, and quit as soon as A is empty.

It seems like your approach will blaze through 500,000 candidates without an opt-out set in place.

yurisich
  • 6,991
  • 7
  • 42
  • 63
  • Sorry I did not make it clear, that the strings are in Chinese, so words are not separated by spaces. I will have to do much more work to find out "all the individual words of `B`". – Felix Yan Jan 13 '12 at 02:56
  • 1
    @FelixYan my last comment should then be my only useful advice to you; find a way to slim down your list of candidates as you comb through `B`. One less member in the outer `for` loop using `A` will speed up your search times, no matter how you go about doing it. – yurisich Jan 13 '12 at 02:58
1

Assume you has all keywords of the same length (later you could extend this algo for different lengths)

I could suggest next:

  1. precalculate some hash for each keyword (for example xor hash):

    hash256 = reduce(int.__xor__, map(ord, keyword))
    
  2. create a dictionary where key is a hash, and value list of corresponding keywords

  3. save keyword length

    curr_keyword = []
    for x in B:
      if len(curr_keyword) == keyword_length:
         hash256 = reduce(int.__xor__, map(ord, curr_keyword))
         if hash256 in dictionary_of_hashed:
            #search in list
    
      curr_keyword.append(x)
      curr_keyword = curr_keyword[1:]
    

Something like this

sjngm
  • 12,423
  • 14
  • 84
  • 114
Alex
  • 1,210
  • 8
  • 15