Ruby: How does one test for similarity between two blocks of text?

Question

So, lets say I have these texts:

Text1:

absolute obedience to the zerg collective sentience known as the Overmind. The Overmind directed the actions of every zerg creature in the Swarm, functioning through a hierarchy of lesser sentients.

Text2:

zerg creature in the Swarm, functioning through a hierarchy of lesser sentients. Although the Overmind was primarily driven by its desire to consume and assimilate

Text 3

When the zerg first arrived in the Koprulu sector, they were unified by their absolute obedience to the zerg collective sentience known as the Overmind. The Overmind directed the actions of every zerg creature in the Swarm, functioning through a hierarchy of lesser sentients. Although the Overmind was primarily driven by its desire to consume and assimilate the advanced protoss race, it found useful but undeveloped material in humanity.

Now, The end of Text1 and the beginning of text2 overlap, so we'd say the text blocks aren't unique. Similarly, with Text3, Text1 can be found inside (as well as Text2) so this is also not unique, due to the overlap.

So, my question:

How do I go about writing something that can look at consecutive letters or words and determine uniqueness? Ideally, I'd want such a method to return some value, representing the amount of similarity--maybe the number of matched words over the average of the two text blocks' size. When it returns 0, both texts tested should be completely unique.

Some problem's I've run into when playing around with Ruby's string methods.

First, I started trying to find the intersection of two strings.

>> a = "nt version, there are no ch"  
>> b = "he current versi"  
>> (a.chars.to_a & b.chars.to_a).join  
=> "nt versihc"

problem with the above method is that it just appends letters that are in common to the end of the result (we lose the order of characters), which would make it hard to test uniqueness. But I don't think intersection is the best way to start this similarity comparison. Any number of combinations of words could be present in both texts that are being compared. So maybe if I made an array of consecutive similarities... but that would require us to traverse one of the texts for as many times as we try phrase lengths.

I guess I really just don't know where to start, and in a way that is efficient and not O(n^too_high).

score 3 · Answer 1 · answered Oct 07 '11 at 20:01

3

Here's a Ruby implementation of the Levenshtein distance algorithm. After installing the gem, you can use it like that:

require 'rubygems'
require 'Text'

t1 = "absolute obedience to the zerg collective sentience known as the Overmind. The Overmind directed the actions of every zerg creature in the Swarm, functioning through a hierarchy of lesser sentients."

t2 = "zerg creature in the Swarm, functioning through a hierarchy of lesser sentients. Although the Overmind was primarily driven by its desire to consume and assimilate"

puts Text::Levenshtein.distance(t1,t2)

answered Oct 07 '11 at 20:01

agregoire

2,022
1
16
15

I feel like this is done on a per-letter basis, rather than consecutive words. =\ idk. Like... I compared the big paragraph after my code block (in the question above) and the big paragraph just before the code block. I did the Text::Levenshtein.distance and divided by the the average length of the two paragraph lengths. the difference percentage i got (the closer to 1, the more different; 0, more similar) was almost the same for the example you gave, as it was for my comparison between my two paragraphs above (starting with 'problem with', and 'how do I go') – NullVoxPopuli Oct 07 '11 at 20:15
1

The "consecutive words" is called an algorithm called 'Longest Common Substring'. – Hock Oct 11 '11 at 23:51

score 3 · Accepted Answer · answered Oct 07 '11 at 20:14

I believe you're looking for is the Longest Common Substring problem, i.e. the problem of finding, given two strings, of the longest substring they both have in common. The link is to the Wikipedia page which will help you understand the domain and contains a pseudocode example of an algorithm that runs in O(nm) time.

Further, Wikibooks' Algorithm Implementation book has an implementation in Ruby. It includes an lcs_size method that may be all you need. In short, if lcs_size(text1, text2) returns 4 that means text1 and text2 have very little consecutive text in common, probably just a single word, but if it returns, say, 40, they might have an entire sentence in common.

Hope that's helpful!

that is helpful! thanks! This might be just what I need to get started =D — NullVoxPopuli, Oct 07 '11 at 20:23

score 2 · Answer 3 · answered Dec 09 '12 at 13:13

2

The amatch gem is perfect for string comparisons.

answered Dec 09 '12 at 13:13

Zuhaib Ali

3,344
3
20
32

score 2 · Answer 4 · answered Oct 07 '11 at 19:52

2

Your problem is not Ruby. It's the algorithm. You could split each text into words, then run a minimum distance algorithm (http://en.wikipedia.org/wiki/Levenshtein_distance) to get that.

The smaller the number, the more similar the texts are.

answered Oct 07 '11 at 19:52

Hock

5,784
2
19
25

score 2 · Answer 5 · answered Oct 07 '11 at 21:28

This could be improved a lot but it's an idea:

txt1 = "absolute obedience to the zerg collective sentience known as the Overmind. The Overmind directed the actions of every zerg creature in the Swarm, functioning through a hierarchy of lesser sentients."
txt2 = "zerg creature in the Swarm, functioning through a hierarchy of lesser sentients. Although the Overmind was primarily driven by its desire to consume and assimilate"

def txt_to_ary(txt)
    txt.gsub(/\.|,/, ' ').downcase.split(/\s+/)
end

def longest_match(txt1, txt2)
    longest = 0
    txt1.each_with_index do |w1, i|
        txt2.each_with_index do |w2, j|
            next unless w1 == w2
            k = 0
            k += 1 while txt1[i+k] == txt2[j+k]
            longest = k if k > longest          
        end
    end
    longest
end

txt1 = txt_to_ary( txt1 )
txt2 = txt_to_ary( txt2 )

puts longest_match(txt1, txt2) #=>12

this line: k += 1 while txt1[i+k] == txt2[j+k] never ends when txt1.length - i == txt2.length -j — wangii, Aug 20 '13 at 21:22

Ruby: How does one test for similarity between two blocks of text?

Text1:

Text2:

Text 3

5 Answers5