Find K-most longest common suffix in a set of strings

Question

I want to find most longest common suffix in a set of strings to detect some potential important morpheme in my natural language process project.

Given frequency K>=2,find the K-most common longest suffix in a list of strings S1,S2,S3...SN

To simplify the problem, here are some examples:

Input1:

 K=2
 S=["fireman","woman","businessman","policeman","businesswoman"]

Output1:

 ["man","eman","woman"]

Explanation1:

"man" occur 4 times, "eman" occur 2 times,"woman" occur 2 times

It would be appreciated if the output also tracked the frequency of each common suffix

{"man":4,"eman":2,"woman":2}

It is not guaranteed that each word will have at least one-length common suffix, see the example below.

Input2:

 K=2
 S=["fireman","woman","businessman","policeman","businesswoman","apple","pineapple","people"]

Output2:

 ["man","eman","woman","ple","apple"]

Explanation2:

"man" occur 4 times, "eman" occur 2 times,"woman" occur 2 times

"ple" occur 3 times, "apple" occur 2 times

Any efficient algorithm to solve this problem ？

I have referred to Suffix Tree and Generalised Suffix Tree algorithm but it seems not to fit this problem well.

(BTW, I am researching on a Chinese NLP project which means there are far more characters than English has)

@WillemVanOnsem I have tried this [python-suffix-tree library](https://github.com/ptrus/suffix-trees), but have no idea to change its source to fit my problem — ken wang, Dec 21 '17 at 15:30
@style suffix tree is derived from Trie tree. standard Trie seems to deal with prefix match problems? — ken wang, Dec 21 '17 at 15:50
If it works for prefix, couldn't you do the following: Reverse your strings, find the prefixes, then reverse the matches? — pault, Dec 21 '17 at 15:52
reverse all strings, put them in a trie, and then the longest path from root without split is the longest suffix — DanielM, Dec 21 '17 at 15:53

SaiBot · Answer 1 · 2017-12-21T16:06:41.300

If you want to be really efficient this paper could help. The algorithm gives you the most frequent (longest) substrings from a set of strings in O(n log n). The basic idea is the following:

Concatenate all strings with whitespace between them (these are used as seperators later) in order to from a single long string
Build a suffix array SA over the long string
For each pair of adjacent entries in the suffix array, check how many of the starting characters are equal of the two suffixes that the entries refer to (you should stop when you reach a whitespace). Put the count of equal characters in an additional array (called LCP). Each entry LCP[i] refers to the count of equal characters of the string at position SA[i-1] and SA[i] in the long string.
Go over the LCP an find successive entries with a count larger than K.

Example from the paper for3 strings

s1= bbabab

s2 = abacac

s3 = bbaaa

score 0 · Answer 2 · edited Jul 30 '21 at 12:41

0

One idea would be to reverse the strings and then run a trie on them. Such a simple trick in the book. Once subsequence is reached, just a matter of reversal again.

E.g. Result from Tries would be

[nam, name, ...]

Reversal of these would give you the final answer.

[man, eman, ...]

Time complexity increased by + O(n).

edited Jul 30 '21 at 12:41

Red

26,798
7
36
58

answered Jul 29 '21 at 19:37

Kamal Raghav

31
3

Find K-most longest common suffix in a set of strings

2 Answers2

Linked