0

Let's say I have a really long string consists of 10^6 tokens (for simplicity, token is a space-separated word, so this string is splitted to list of tokens)

now I need to find all possible duplicated sequences and the start of the duplication locations in the string. For example:

(The brackets are not really in the string, they only to clarify the location)

this[0] string[1] is[2] test[3] to[4] check[5] duplication[6]
test[7] to[8] check[9] duplication[10] this[11] string[12]

==> at 0,11 - 2 tokens duplication
==> at 3,7 - 4 tokens duplication

I've tried to build Python program with an algorithm based on dictionary that keeps a list of each token index and checks token matches from those indexes. That is far too slow, even when I used Numpy instead of list.

Then I tried to use Suffix tree. But all methods tend to use letters rather than words. When I think of converting this algorithm to use tokens instead of letters, it could work if I used many small strings. The problem I have one huge string so it creates one long tree.

All the answer in Stackoverflow and all over the internet are not considering one long string. Any Ideas for best CPU performance algorithm? (RAM performance is less important) Thanks

Izik
  • 746
  • 1
  • 9
  • 25
  • I meant it will be one very long branch, rather then actual tree. Becuase I dont compare different string the trie tree method doesn't seem to help. Unless I don't understand something in the algorithm – Izik Oct 25 '22 at 22:43
  • As for the typo, it not , it is means it found 4 consecutive matches ("test to check duplication") – Izik Oct 25 '22 at 22:45
  • Thanks for the clarification, the question sounds like the word-token version of "find all repeated non-overlapping substrings". Since there are existing algorithms for letter-token versions (trie, Rabin-Karp, etc). My idea is to overload their letter-compare ops with word-compare ops. – Xin Cheng Oct 26 '22 at 04:25

2 Answers2

0

You wish to identify repeated bi-grams.

Optionally construct a dictionary for converting str to int, if desired.

Iterate over the document, generating a bi-gram for current position, then advance to next position. Store these in a bigram_to_index_list dict in memory, or perhaps in an out-of-core file or database table. A defaultdict(list) will prove convenient for the in-memory solution.

Now iterate over all entries where we have multiple index position for a given bigram. Probe the original string to see if we can extend to a tri-gram or greater, and output such results.

J_H
  • 17,926
  • 4
  • 24
  • 44
  • Sorry if I misunderstand but is it much different that what i've tried? I also used dictionary to store the tokens indexes (I also did convert strings to numbers and used defaultdict). I'm trying to find a whole new algorithm, which probably using tree, because even with some improvements, the dictionary takes too much time – Izik Oct 25 '22 at 22:41
  • I understood your approach to be built on unigrams. Some unigrams have high entropy, they're highly selective, like "sesquipedalian", but others have low entropy, like "the", and that slows down the search for dups. Your n=2 and n=4 examples made me believe we need at least a bigram for a "duplicated sequence". And you said we can use lots of storage. So storing bigrams, and hashing / sorting on that, offers much better selectivity, and a good beginning point for identifying longer trigram / quadgram at same start index. Post the corpus and the code you run if detailed timings are needed. – J_H Oct 25 '22 at 23:07
0

@Izik Since I'm a new contributor I can't add a comment! Like suggested in @J_H's comment the only alternative to reduce "drastically" the searching time is to use the hashing technique. Here is a snippet coded in Java that works for a short token list. Maybe there's an equivalent of the HashMap class in Python.

    String[] tokens = new String[]{"this", "string", "is", "test", "to",
    "check", "duplication", "test", "to", "check", "duplication", "this",
    "string", "this", "string", "is", "test", "to", "check", "duplication",
    "test", "to", "check"};
List<tp> tp_list = new List();
HashMap<String, Integer> token_map = new HashMap();

class tp {

    String t;
    List<Integer> poslist;

    tp(String tok, int pos) {
        this.t = tok;
        this.poslist = new List();
        this.poslist.add(pos);
    }
}

void createDuplicateLists() {

    tp_list.add(new tp(tokens[0], 0));
    int i = 1;
    int j = 0;
    token_map.clear();
    token_map.put(tokens[0], j);
    while (i < tokens.length) {
        String tok = tokens[i];
        if (token_map.containsKey(tok)) {
            tp tkp = tp_list.get(token_map.get(tok));
            tkp.poslist.add(i);
        } else {
            tp_list.add(new tp(tok, i));
            j++;
            token_map.put(tok, j);
        }
        i++;
    }
}

void main(String[] args) {

    createDuplicateLists();
    printLists();
}

/*
Printed Lists:

this:[0, 11, 13]
string:[1, 12, 14]
is:[2, 15]
test:[3, 7, 16, 20]
to:[4, 8, 17, 21]
check:[5, 9, 18, 22]
duplication:[6, 10, 19]
*/
SudoKoach
  • 366
  • 1
  • 6
  • Hi, thanks for your effort, but according to my understanding, this code finds duplicate words(tokens). What I'm trying to do is to find duplicate *sequences* of words. E.g in my example "this string" is a duplication of 2 tokens, "test to check duplication" is a duplication of 4 tokens – Izik Nov 08 '22 at 15:18
  • Hi, thanks for your clarifiying comment. I will update the code. – SudoKoach Nov 09 '22 at 11:13
  • @Izik In your original question you've written ... so this string is splitted to list of tokens ... – SudoKoach Nov 09 '22 at 12:16
  • @Izik In your original question you've written ... so this string is splitted to list of tokens). I've a clarifiying question. What is the separator of the different "list of tokens" or sequences of words you mentioned in your hereabove comment? – SudoKoach Nov 09 '22 at 13:38
  • I receive a (very) long string. For the solution, somewhere along the way this string is indeed splitted to a list of Tokens. Token can be defined by many things, but for simplicity, as mentioned in my question, Token is any space-separated word. – Izik Nov 09 '22 at 16:09
  • @Izik I think I understood your problem. The code I've written applies to 1-token duplicates. In the 4-token duplicate, -test[3] to[4] check[5] duplication[6]-, do you consider the 2-token and 3-token sub-duplicates, like -test[3] to[4]' and 'to[4] check[5] duplication[6]- as different duplicates? If not how do you know they are parts of a 4-token duplicate? – SudoKoach Nov 09 '22 at 17:01
  • Inner duplications need to be disregarded. In my example, only "this string" and "test to check duplication" are considered duplicated sequences. The sequence "to check duplication" is already included inside "test to check duplication", Therefore it is disregarded. Currently, in my program, I have my own algorithm to disregard inner sequence duplications, so this is less of a problem. – Izik Nov 09 '22 at 18:13