-1

So, lets say I have this line of code

x = 'My name is James Bond'
y = 'My name is James Bond and I am an MI-6 agent stationed in London, UK'
from difflib import SequenceMatcher as sm
sm(None, x, y)

Now, the ratio being returned is 0.47191011235955055, which is fair.

My problem is - x is present in its entirety in y. I was hoping to get a faily high match. Looking at it another way, I am basically looking for some sort of plagiarism detection.

UPDATE: Being more specific. In the above example I'd expected a match of 100% since x is present in y in its entirety. However, that may not be a clear-cut case in every example.

Another example:

x = "My name is James Herbert Bond"

Here x has an extra word, so some matching method would give me a less desirable matching percent (say 90%) since there is only one extra word called "Herbert" in x that is not present in y.

Pankaj Singh
  • 526
  • 7
  • 21
  • Can you be more specific on what your desired output is? "hoping to get a fairly high match" is not very descriptive of what you want to do. – enumaris Jul 22 '19 at 20:55
  • If you search in your browser for "Python plagiarism detection", you'll find references that can explain this much better than we can manage here. – Prune Jul 22 '19 at 20:56
  • Please follow the posting guidelines in the help documentation, as suggested when you created this account. [On topic](https://stackoverflow.com/help/on-topic), [how to ask](https://stackoverflow.com/help/how-to-ask), and ... [the perfect question](https://codeblog.jonskeet.uk/2010/08/29/writing-the-perfect-question/) apply here. – Prune Jul 22 '19 at 20:57
  • The question in the title is as simple as `x in y` which evaluates to `True`. But otherwise, you may need to look into something like Jellyfish, which has several different distance metrics, some of which can be weighted to give a higher match if the strings start the same way – G. Anderson Jul 22 '19 at 21:05
  • @enamaris, Prune. "fairly high match" is what something I am looking for. I dont know if x in y should have a 90% match or a 70% match. I haven't tried G.Anderson' solution yet, but Sunitha's suggestion is something I'd be keen to implement. – Pankaj Singh Jul 23 '19 at 14:49

2 Answers2

0

I would suggest you to look into partial_ratio method in fuzzywuzzy module.

>>> x = 'My name is James Bond'
>>> y = 'My name is James Bond and I am an MI-6 agent stationed in London, UK'
>>> 
>>> from fuzzywuzzy import fuzz
>>> fuzz.partial_ratio(x, y)
100
>>> 
>>> x = "My name is James Herbert Bond"
>>> fuzz.partial_ratio(x, y)
72
Sunitha
  • 11,777
  • 2
  • 20
  • 23
0

Sum lengths of non-overlapping matching subsequences and divide by the length of the first sequence.

from difflib import SequenceMatcher
x = 'My name is James Bond'
y = 'My name is James Bond and I am an MI-6 agent stationed in London, UK'
ratio = sum([i.size for i in SequenceMatcher(None, x, y).get_matching_blocks()])/len(x)
print(ratio)

This will get the output of 1.0

Lukas
  • 313
  • 3
  • 5