So, lets say I have these texts:
Text1:
absolute obedience to the zerg collective sentience known as the Overmind. The Overmind directed the actions of every zerg creature in the Swarm, functioning through a hierarchy of lesser sentients.
Text2:
zerg creature in the Swarm, functioning through a hierarchy of lesser sentients. Although the Overmind was primarily driven by its desire to consume and assimilate
Text 3
When the zerg first arrived in the Koprulu sector, they were unified by their absolute obedience to the zerg collective sentience known as the Overmind. The Overmind directed the actions of every zerg creature in the Swarm, functioning through a hierarchy of lesser sentients. Although the Overmind was primarily driven by its desire to consume and assimilate the advanced protoss race, it found useful but undeveloped material in humanity.
Now, The end of Text1 and the beginning of text2 overlap, so we'd say the text blocks aren't unique. Similarly, with Text3, Text1 can be found inside (as well as Text2) so this is also not unique, due to the overlap.
So, my question:
How do I go about writing something that can look at consecutive letters or words and determine uniqueness? Ideally, I'd want such a method to return some value, representing the amount of similarity--maybe the number of matched words over the average of the two text blocks' size. When it returns 0, both texts tested should be completely unique.
Some problem's I've run into when playing around with Ruby's string methods.
First, I started trying to find the intersection of two strings.
>> a = "nt version, there are no ch"
>> b = "he current versi"
>> (a.chars.to_a & b.chars.to_a).join
=> "nt versihc"
problem with the above method is that it just appends letters that are in common to the end of the result (we lose the order of characters), which would make it hard to test uniqueness. But I don't think intersection is the best way to start this similarity comparison. Any number of combinations of words could be present in both texts that are being compared. So maybe if I made an array of consecutive similarities... but that would require us to traverse one of the texts for as many times as we try phrase lengths.
I guess I really just don't know where to start, and in a way that is efficient and not O(n^too_high)
.