Longest word match in an array of strings

Question

Assume a large set of arrays of individual words (not phrases), e.g.

{"One", "two", "three", "four"}
{"One", "two", "three"}
{"One", "two", "where", "are", "you"}
{"One", "other"}
{"Two", "three", "four"}
{"More", "more", "more"}

Given another array of individual words, what would be the most efficient (fastest) way of finding the longest common match, left to right, other than the "brute force" solution (i.e., continuous string matching)?

For example, given the array {"One", "two", "three", "four", "five"} the longest common match in the above list would be {"One", "two", "three", "four"}.

Are gaps allowed? What would `{"One", "two", "three", "hello", "four", "five"}` match? — Sergey Kalinichenko, Dec 03 '14 at 05:04
@dasblinkenlight The words are not collated (as they would be in a phrase, separated e.e. by spaces), they are individual tokens. — PNS, Dec 03 '14 at 05:06
@PNS So `{"One", "two", "three", "four"}` would be matched for my example, right? — Sergey Kalinichenko, Dec 03 '14 at 05:07
@dasblinkenlight Your example would match {"One", "two", "three"} only, since the "hello" word following after that does not match any array. The number of distinct words would be relatively small (up to, say, 1,000). — PNS, Dec 03 '14 at 05:10
It makes me think of the suffix tree or Aho Corasick. You can map string to characters, and each array becomes a string which can be fed into the 2 algorithm above. I am not too sure about your requirement, though. — nhahtdh, Dec 03 '14 at 06:27
Good ideas. A custom trie may be a solution, too. The words cannot be joined into a string, because they come from a tokenization process, with separators that vary. — PNS, Dec 03 '14 at 07:09
@PNS: The idea is to map each string in an array to a single character, then concatenate all characters into a string that represents the whole array. — nhahtdh, Dec 03 '14 at 09:41

Longest word match in an array of strings

0 Answers0