0

Assume a large set of arrays of individual words (not phrases), e.g.

{"One", "two", "three", "four"}
{"One", "two", "three"}
{"One", "two", "where", "are", "you"}
{"One", "other"}
{"Two", "three", "four"}
{"More", "more", "more"}

Given another array of individual words, what would be the most efficient (fastest) way of finding the longest common match, left to right, other than the "brute force" solution (i.e., continuous string matching)?

For example, given the array {"One", "two", "three", "four", "five"} the longest common match in the above list would be {"One", "two", "three", "four"}.

PNS
  • 19,295
  • 32
  • 96
  • 143
  • Are gaps allowed? What would `{"One", "two", "three", "hello", "four", "five"}` match? – Sergey Kalinichenko Dec 03 '14 at 05:04
  • @dasblinkenlight The words are not collated (as they would be in a phrase, separated e.e. by spaces), they are individual tokens. – PNS Dec 03 '14 at 05:06
  • @PNS So `{"One", "two", "three", "four"}` would be matched for my example, right? – Sergey Kalinichenko Dec 03 '14 at 05:07
  • @dasblinkenlight Your example would match {"One", "two", "three"} only, since the "hello" word following after that does not match any array. The number of distinct words would be relatively small (up to, say, 1,000). – PNS Dec 03 '14 at 05:10
  • It makes me think of the suffix tree or Aho Corasick. You can map string to characters, and each array becomes a string which can be fed into the 2 algorithm above. I am not too sure about your requirement, though. – nhahtdh Dec 03 '14 at 06:27
  • Good ideas. A custom trie may be a solution, too. The words cannot be joined into a string, because they come from a tokenization process, with separators that vary. – PNS Dec 03 '14 at 07:09
  • @PNS: The idea is to map each string in an array to a single character, then concatenate all characters into a string that represents the whole array. – nhahtdh Dec 03 '14 at 09:41

0 Answers0