I want to be able to compare a passage with multiple (say thousands or even more) different passages and see if any part of those passages in exactly used in the first one.
Imaging you have a passage named A
which you want to check and see if it contains any sentence or part of a sentence from the other thousands of passages.
I though of a very inefficient way and no better answer comes to my mind. My way is to read the first three words from the input passage (A
). Then, check to see if any exact match is in the database of all the thousands texts. If there was any match, list them and then add the forth word to the string and find matches to the 4-word
string among the list from 3-word
matches. Do this until there are no more matches with the n-word
string. The list of (n-1)-word
would be saved as the result of this run. Next, the new 3-word
string would be nth
, (n+1)th
and (n+2)th
words and everything starts again until the document ends.
This would be very inefficient for large input text and huge database of comparing texts. Any better algorithm?