2

I want to find most longest common suffix in a set of strings to detect some potential important morpheme in my natural language process project.

Given frequency K>=2,find the K-most common longest suffix in a list of strings S1,S2,S3...SN

To simplify the problem, here are some examples:

Input1:

 K=2
 S=["fireman","woman","businessman","policeman","businesswoman"]

Output1:

 ["man","eman","woman"]

Explanation1:

"man" occur 4 times, "eman" occur 2 times,"woman" occur 2 times


It would be appreciated if the output also tracked the frequency of each common suffix

{"man":4,"eman":2,"woman":2}

It is not guaranteed that each word will have at least one-length common suffix, see the example below.

Input2:

 K=2
 S=["fireman","woman","businessman","policeman","businesswoman","apple","pineapple","people"]

Output2:

 ["man","eman","woman","ple","apple"]

Explanation2:

"man" occur 4 times, "eman" occur 2 times,"woman" occur 2 times

"ple" occur 3 times, "apple" occur 2 times


Any efficient algorithm to solve this problem ?

I have referred to Suffix Tree and Generalised Suffix Tree algorithm but it seems not to fit this problem well.

(BTW, I am researching on a Chinese NLP project which means there are far more characters than English has)

martineau
  • 119,623
  • 25
  • 170
  • 301
ken wang
  • 165
  • 1
  • 12
  • @WillemVanOnsem I have tried this [python-suffix-tree library](https://github.com/ptrus/suffix-trees), but have no idea to change its source to fit my problem – ken wang Dec 21 '17 at 15:30
  • what about Trie? https://github.com/google/pygtrie – DanielM Dec 21 '17 at 15:44
  • @style suffix tree is derived from Trie tree. standard Trie seems to deal with prefix match problems? – ken wang Dec 21 '17 at 15:50
  • If it works for prefix, couldn't you do the following: Reverse your strings, find the prefixes, then reverse the matches? – pault Dec 21 '17 at 15:52
  • 1
    reverse all strings, put them in a trie, and then the longest path from root without split is the longest suffix – DanielM Dec 21 '17 at 15:53

2 Answers2

1

If you want to be really efficient this paper could help. The algorithm gives you the most frequent (longest) substrings from a set of strings in O(n log n). The basic idea is the following:

  1. Concatenate all strings with whitespace between them (these are used as seperators later) in order to from a single long string

  2. Build a suffix array SA over the long string

  3. For each pair of adjacent entries in the suffix array, check how many of the starting characters are equal of the two suffixes that the entries refer to (you should stop when you reach a whitespace). Put the count of equal characters in an additional array (called LCP). Each entry LCP[i] refers to the count of equal characters of the string at position SA[i-1] and SA[i] in the long string.

  4. Go over the LCP an find successive entries with a count larger than K.

Example from the paper for3 strings

s1= bbabab

s2 = abacac

s3 = bbaaa

enter image description here

SaiBot
  • 3,595
  • 1
  • 13
  • 19
0

One idea would be to reverse the strings and then run a trie on them. Such a simple trick in the book. Once subsequence is reached, just a matter of reversal again.

E.g. Result from Tries would be

[nam, name, ...]

Reversal of these would give you the final answer.

[man, eman, ...]

Time complexity increased by + O(n).

Red
  • 26,798
  • 7
  • 36
  • 58