Let's say I have a really long string consists of 10^6 tokens (for simplicity, token is a space-separated word, so this string is splitted to list of tokens)
now I need to find all possible duplicated sequences and the start of the duplication locations in the string. For example:
(The brackets are not really in the string, they only to clarify the location)
this[0] string[1] is[2] test[3] to[4] check[5] duplication[6]
test[7] to[8] check[9] duplication[10] this[11] string[12]
==> at 0,11 - 2 tokens duplication
==> at 3,7 - 4 tokens duplication
I've tried to build Python program with an algorithm based on dictionary that keeps a list of each token index and checks token matches from those indexes. That is far too slow, even when I used Numpy instead of list.
Then I tried to use Suffix tree. But all methods tend to use letters rather than words. When I think of converting this algorithm to use tokens instead of letters, it could work if I used many small strings. The problem I have one huge string so it creates one long tree.
All the answer in Stackoverflow and all over the internet are not considering one long string. Any Ideas for best CPU performance algorithm? (RAM performance is less important) Thanks