7

I was trying out python's difflib module and I came across SequenceMatcher. So, I tried the following examples but couldn't understand what is happening.

>>> SequenceMatcher(None,"abc","a").ratio()
0.5

>>> SequenceMatcher(None,"aabc","a").ratio()
0.4

>>> SequenceMatcher(None,"aabc","aa").ratio()
0.6666666666666666

Now, according to the ratio:

Return a measure of the sequences' similarity as a float in the range [0, 1]. Where T is the total number of elements in both sequences, and M is the number of matches, this is 2.0*M / T.

so, for my cases:

  1. T=4 and M=1 so ratio 2*1/4 = 0.5
  2. T=5 and M=2 so ratio 2*2/5 = 0.8
  3. T=6 and M=1 so ratio 2*1/6.0 = 0.33

According to my understanding T = len(aabc) + len(a) and M=2 because a comes twice in aabc.

So, where am I getting wrong what am I missing.?

Here is the source code of SequenceMatcher.ratio()

Bakuriu
  • 98,325
  • 22
  • 197
  • 231
RanRag
  • 48,359
  • 38
  • 114
  • 167

2 Answers2

5

You've got the first case right. In the second case, only one a from aabc matches, so M = 1. In the third example, both as match so M = 2.

[P.S.: you're referring to the ancient Python 2.4 source code. The current source code is at hg.python.org.]

Fred Foo
  • 355,277
  • 75
  • 744
  • 836
  • 1
    But why only one `a` from `aabc` matches. I believe it should match both `a's`. In third example both `a's` match but is `aa` matched to `aa` alone or it is first matched to `a` and than the next `a` making it M=2. It is still not clear to me. – RanRag Sep 15 '12 at 11:20
  • @Noob: because in the second case, there's only on `a` in the second string to match with, while in the third there are two, so only the `bc` part matches nothing. Try matching `a` with `aa`. – Fred Foo Sep 15 '12 at 11:40
  • 3
    Matches actually stands for _character_ matches. So the string "a" can only do one match, because it has only a single character that can create a match. "aa" has two characters, and thus provides two matches. – Bakuriu Sep 15 '12 at 11:58
  • Hey is there any way to get the number of matches? – Mohsin May 12 '17 at 12:59
  • Much too late, but still: you could do `T=len(a)+len(b)` `M=T*ratio()/2`. – Rorschach May 21 '21 at 07:47
0

never too late...

from difflib import SequenceMatcher

texto1 = 'BRASILIA~DISTRITO FEDERAL, DF'
texto2 = 'BRASILIA-DISTRITO FEDERAL, '

tamanho_texto1 = len(texto1)
tamanho_texto2 = len(texto2)
tamanho_tot = tamanho_texto1 + tamanho_texto2

tot = 0
if texto1 <= texto2:
    for x in range(len(texto1)):
        y = texto1[x]

        if y in texto2:
            tot += 1
else:
    for x in range(len(texto2)):
        y = texto2[x]

        if y in texto1:
            tot += 1
            
print('sequenceM = ',SequenceMatcher(None, texto1, texto2).ratio())
print('Total calculado = ',2*tot/tamanho_tot)

sequenceM = 0.9285714285714286

Total calculado = 0.9285714285714286

  • Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community May 30 '22 at 03:14