Comparing a text against multiple ones and find texts with matching sentences

Question

I want to be able to compare a passage with multiple (say thousands or even more) different passages and see if any part of those passages in exactly used in the first one.

Imaging you have a passage named A which you want to check and see if it contains any sentence or part of a sentence from the other thousands of passages.

I though of a very inefficient way and no better answer comes to my mind. My way is to read the first three words from the input passage (A). Then, check to see if any exact match is in the database of all the thousands texts. If there was any match, list them and then add the forth word to the string and find matches to the 4-word string among the list from 3-word matches. Do this until there are no more matches with the n-word string. The list of (n-1)-word would be saved as the result of this run. Next, the new 3-word string would be nth, (n+1)th and (n+2)th words and everything starts again until the document ends.

This would be very inefficient for large input text and huge database of comparing texts. Any better algorithm?

Are there any requirements such as index space limitations. Eg, you could index groups of words (eg each 3 consecutive words) in the reference texts, and then search for those to cut down the number of possible texts to search for the full text. Also, MySQL has a full text search facility, which you could use as a preliminary method. — pscs, Feb 16 '14 at 23:17
@pscs, I'm not sure how efficient would be the indexing you mentioned. Have you tried similar thing before? With my understanding (that may be wrong) on full text search, it may not be helpful. — SAVAFA, Feb 16 '14 at 23:37
@SAVAFA - we have used grouped word indexes, and they speed up searches considerably - because they can be indexed, whereas normal LIKE style searches in text cannot be indexed. The problem is that the indexes can become very large, which is why I asked the question. I haven't used MySQL full text search, but with other database engines, using a DB FTS index is much quicker than using normal substring searches. It isn't as accurate, which is why you'd then do another substring search afterwards. — pscs, Feb 16 '14 at 23:47

Comparing a text against multiple ones and find texts with matching sentences

0 Answers0