How to efficiently search large dataset for substrings?

Question

I have a large set of short strings. What are some algorithms and indexing strategies for filtering the list on items that contain a substring? For example, suppose I have a list:

val words = List(
  "pick",
  "prepick",
  "picks",
  "picking",
  "kingly"
  ...
)

How could I find strings that contain the substring "king"? I could brute force the problem like so:

words.filter(_.indexOf("king") != -1) // yields List("picking", "kingly")

This is only practical for small sets; Today I need to support 10 million strings, with a future goal in the billions. Obviously I need to build an index. What kind of index?

I have looked at using an ngram index stored in MySQL, but I am not sure if this is the best approach. I'm not sure how to optimally query the index when the search string is longer than the ngram size.

I have also considered using Lucene, but this is optimized around token matching, not substring matching, and does not seem to support the requirement of simple substring matching. Lucene does have a few classes related to ngrams (org.apache.lucene.analysis.ngram.NGramTokenFilter is one example), but these seem to be intended for spell check and autocomplete use cases, not substring matching, and the documentation is thin.

What other algorithms and indexing strategies should I consider? Are there any open source libraries that support this? Can the SQL or Lucene strategies (above) be made to work?

Another way to illustrate the requirement is with SQL:

SELECT word FROM words WHERE word LIKE CONCAT('%', ?, '%');

Where ? is a user provided search string, and the result is a list of words that contain the search string.

@nhahtdh could you provide more detail? 1 suffix tree per string? wouldn't that require all 10 million trees to be interrogated for a single lookup? — Landon Kuhn, Aug 02 '12 at 19:36
Possible duplicate of http://stackoverflow.com/questions/8756158/partial-match-on-a-dictionary and http://stackoverflow.com/questions/7258125/tagging-of-names-using-lucene-java — Sap, Aug 05 '12 at 19:13

score 2 · Answer 1 · answered Aug 02 '12 at 19:36

2

How big is the longest word? if that's about 7-8 char you may find all substrings for each and every string and insert that substrings in trie (the one is used in Aho-Corasik - http://en.wikipedia.org/wiki/Aho-Corasick) It will take some time to build the tree but then searching for all occurances will be O(length(searched word)).

answered Aug 02 '12 at 19:36

Roman Dzhabarov

521
4
10

Your suggestion is to build a trie containing every substring, with each node containing a list of every matched word? – Landon Kuhn Aug 02 '12 at 20:54
As a result it will be, since separate letter is also substring. Yeah, will be too much memory consumption. – Roman Dzhabarov Aug 02 '12 at 21:06
Will the word we are examine from the initial dictionary as well? – Roman Dzhabarov Aug 02 '12 at 21:10
No, the input search string is not necessarily in the dictionary. I'm not sure the high memory consumption point is a deal breaker. I wonder if it would be possible to model the trie, and the query, in SQL. That sounds like a long shot. – Landon Kuhn Aug 02 '12 at 21:23
I doubt that there is an index which will serve queries like %something%. But suppose queries like something% should be served fast if we have regular B-tree index. So, from each word from dictionary we may produce few other (pick -> pick, ick, ck, k) and put in separate table (id, stringvalue, initial_string_id) and then do select based on like word%. I guess it should be faster than doing regular %word%. – Roman Dzhabarov Aug 02 '12 at 22:12

Arvind · Answer 2 · 2012-08-08T02:56:19.290

Postgres has a module which does a trigram index

That seems an interesting idea too- building a trigram index.

About a comment in your question regarding how to break down text searches greater than n-gram length:

Here's one approach which will work:

Say we have a search string as "abcde" , and we have built a trigram index. (You have strings which are of smaller lengths-this could hit a sweet spot for you) Let the search results of abc= S1, bcd=S2,cde=S3 (where S1,S2,S3 are sets of indexes )

Then the longest common substring of S1,S2,S3 will give the indexes that we want.

We can transform each set of indexes,as a single string separated by a delimiter (say space) before doing LCS.

After we find the LCS,we would have to search the indexes for the complete pattern,since we have broken down the search term. ie we would have to prune results which have "abc-XYZ- bcd-HJI-def"

The LCS of a set of strings can be efficiently found Suffix Arrays. or Suffix trees

@landon9720 : Please do comment when you get a chance to view my answer. I would like to know your opinion about the approach I suggested. — Arvind, Aug 08 '12 at 02:54

How to efficiently search large dataset for substrings?

2 Answers2