Efficient mass string search problem

Question

The Problem: A large static list of strings is provided. A pattern string comprised of data and wildcard elements (* and ?). The idea is to return all the strings that match the pattern - simple enough.

Current Solution: I'm currently using a linear approach of scanning the large list and globbing each entry against the pattern.

My Question: Are there any suitable data structures that I can store the large list into such that the search's complexity is less than O(n)?

Perhaps something akin to a suffix-trie? I've also considered using bi- and tri-grams in a hashtable, but the logic required in evaluating a match based on a merge of the list of words returned and the pattern is a nightmare, furthermore I'm not convinced its the correct approach.

Are the strings composed of words, and are the patterns word-based? If so, there is a host of information-retrieval techniques that you can use to speed up the search -- if you pay for the O(N) cost of initially indexing it. The best part is that there are plenty of libraries for that. — tucuxi, Mar 29 '10 at 23:46

David Leonard · Answer 1 · 2010-03-28T09:19:40.897

you could build a regular trie and add wildcard edges. then your complexity would be O(n) where n is the length of the pattern. You would have to replace runs of ** with * in the pattern first (also an O(n) operation).

If the list of words were I am an ox then the trie would look a bit like this:

  (I ($ [I])
   a (m ($ [am])
      n ($ [an])
      ? ($ [am an])
      * ($ [am an]))
   o (x ($ [ox])
      ? ($ [ox])
      * ($ [ox]))
   ? ($ [I]
      m ($ [am])
      n ($ [an])
      x ($ [ox])
      ? ($ [am an ox])
      * ($ [I am an ox]
         m ($ [am]) ...)
   * ($ [I am an ox]
      I ...
    ...

And here is a sample python program:

import sys

def addWord(root, word):
    add(root, word, word, '')

def add(root, word, tail, prev):
    if tail == '':
        addLeaf(root, word)
    else:
        head = tail[0]
        tail2 = tail[1:]
        add(addEdge(root, head), word, tail2, head)
        add(addEdge(root, '?'), word, tail2, head)
    if prev != '*':
        for l in range(len(tail)+1):
           add(addEdge(root, '*'), word, tail[l:], '*')

def addEdge(root, char):
    if not root.has_key(char):
        root[char] = {}
    return root[char]

def addLeaf(root, word):
    if not root.has_key('$'):
        root['$'] = []
    leaf = root['$']
    if word not in leaf:
        leaf.append(word)

def findWord(root, pattern):
    prev = ''
    for p in pattern:
        if p == '*' and prev == '*':
            continue
        prev = p
        if not root.has_key(p):
            return []
        root = root[p]
    if not root.has_key('$'):
        return []
    return root['$']

def run():
    print("Enter words, one per line terminate with a . on a line")
    root = {}
    while 1:
        line = sys.stdin.readline()[:-1]
        if line == '.': break
        addWord(root, line)
    print(repr(root))
    print("Now enter search patterns. Do not use multiple sequential '*'s")
    while 1:
        line = sys.stdin.readline()[:-1]
        if line == '.': break
        print(findWord(root, line))

run()

@Monomer a wildcard character from the pattern. The idea being you build a tree that answers all valid patterns. — David Leonard, Mar 28 '10 at 08:38

score 1 · Answer 2 · answered Mar 28 '10 at 08:20

1

I agree that a suffix trie is a good idea to try, except that the sheer size of your dataset might make it's construction use up just as much time as its usage would save. Theyre best if youve got to query them multiple times to amortize the construction cost. Perhaps a few hundred queries.

Also note that this is a good excuse for parallelism. Cut the list in two and give it to two different processors and have your job done twice as fast.

answered Mar 28 '10 at 08:20

Karl

8,967
5
29
31

Unfortunately you can't have O(1) and wildcards. Maybe if the problem were rephrased to work on the word level instead of character-wise you could cut the search space. – Karl Mar 28 '10 at 08:39

score 0 · Answer 3 · answered Jan 25 '14 at 04:33

There are plenty of good algorithms for multi-string search. Google "Navarro string search" and you'll see a good analysis of multi-string options. A number of algorithsm are extremely good for "normal" cases (search strings that are fairly long: Wu-Manber; search strings with characters that are modestly rare in the text to be searched: parallel Horspool). Aho-Corasick is an algorithm that guarantees a (tiny) bounded amount of work per input character, no matter how the input text is tuned to create worst behaviour in the search. For programs like Snort, that's really important, in the face of denial-of-service attacks. If you are interested in how a really efficient Aho-Corasick search can be implemented, take a look at ACISM - an Aho-Corasick Interleaved State Matrix.

score 0 · Answer 4 · answered Mar 28 '10 at 08:19

If you don't care about memory and you can afford to pre-process the list, create a sorted array of every suffix, pointing to the original word, e.g., for ['hello', 'world'], store this:

[('d'    , 'world'),
 ('ello' , 'hello'),
 ('hello', 'hello'),
 ('ld'   , 'world'),
 ('llo'  , 'hello'),
 ('lo'   , 'hello'),
 ('o'    , 'hello'),
 ('orld' , 'world'),
 ('rld'  , 'world'),
 ('world', 'world')]

Use this array to build sets of candidate matches using pieces of the pattern.

For instance, if the pattern is *or*, find the candidate match ('orld' , 'world') using a binary chop on the substring or, then confirm the match using a normal globbing approach.

If the wildcard is more complex, e.g., h*o, built sets of candidates for h and o and find their intersection before the final linear glob.

You don't store every conceivable suffix, you take the suffixes from your static list. — Marcelo Cantos, Mar 28 '10 at 08:35

score 0 · Answer 5 · answered Mar 28 '10 at 08:42

0

You say you're currently doing linear search. Does this give you any data on the most frequently performed query patterns? e.g. is blah* much more common than bl?h (which i'd assume it was) among your current users?

With that kind of prior knowledge you can focus your indexing efforts on the commonly used cases and get them down to O(1), rather than trying to solve the much more difficult, and yet much less worthwhile, problem of making every possible query equally fast.

answered Mar 28 '10 at 08:42

Daniel Earwicker

114,894
38
205
284

@Daniel: I've tried performing statistics on the types of patterns being queried, there aren't any obvious winners. One thing that i did not mention in the original question is that the strings in the static list have a maximum size, and that the average is rougly half the maxium with a stdev of about 1/4 of the average. Not sure if that provides any insight into the problem. – Mar 28 '10 at 10:00
So you wouldn't even say that using one wildcard is a lot more common than using five wildcards? – Daniel Earwicker Mar 28 '10 at 10:11

score 0 · Answer 6 · answered Mar 30 '10 at 00:09

You can achieve a simple speedup by keeping counts of the characters in your strings. A string with no bs or a single b can never match the query abba*, so there is no point in testing it. This works much better on whole words, if your strings are made of those, since there are many more words than characters; plus, there are plenty of libraries that can build the indexes for you. On the other hand, it is very similar to the n-gram approach you mentioned.

If you do not use a library that does it for you, you can optimize queries by looking up the most globally infrequent characters (or words, or n-grams) first in your indexes. This allows you to discard more non-matching strings up front.

In general, all speedups will be based on the idea of discarding things that cannot possibly match. What and how much to index depends on your data. For example, if the typical pattern length is near to the string length, you can simply check to see if the string is long enough to hold the pattern.

Efficient mass string search problem

6 Answers6