How to match a text contained in one variable to another

Question

So, lets say I have this line of code

x = 'My name is James Bond'
y = 'My name is James Bond and I am an MI-6 agent stationed in London, UK'
from difflib import SequenceMatcher as sm
sm(None, x, y)

Now, the ratio being returned is 0.47191011235955055, which is fair.

My problem is - x is present in its entirety in y. I was hoping to get a faily high match. Looking at it another way, I am basically looking for some sort of plagiarism detection.

UPDATE: Being more specific. In the above example I'd expected a match of 100% since x is present in y in its entirety. However, that may not be a clear-cut case in every example.

Another example:

x = "My name is James Herbert Bond"

Here x has an extra word, so some matching method would give me a less desirable matching percent (say 90%) since there is only one extra word called "Herbert" in x that is not present in y.

Can you be more specific on what your desired output is? "hoping to get a fairly high match" is not very descriptive of what you want to do. — enumaris, Jul 22 '19 at 20:55
If you search in your browser for "Python plagiarism detection", you'll find references that can explain this much better than we can manage here. — Prune, Jul 22 '19 at 20:56
Please follow the posting guidelines in the help documentation, as suggested when you created this account. [On topic](https://stackoverflow.com/help/on-topic), [how to ask](https://stackoverflow.com/help/how-to-ask), and ... [the perfect question](https://codeblog.jonskeet.uk/2010/08/29/writing-the-perfect-question/) apply here. — Prune, Jul 22 '19 at 20:57
The question in the title is as simple as `x in y` which evaluates to `True`. But otherwise, you may need to look into something like Jellyfish, which has several different distance metrics, some of which can be weighted to give a higher match if the strings start the same way — G. Anderson, Jul 22 '19 at 21:05
@enamaris, Prune. "fairly high match" is what something I am looking for. I dont know if x in y should have a 90% match or a 70% match. I haven't tried G.Anderson' solution yet, but Sunitha's suggestion is something I'd be keen to implement. — Pankaj Singh, Jul 23 '19 at 14:49

score 0 · Answer 1 · answered Jul 22 '19 at 21:19

0

I would suggest you to look into partial_ratio method in fuzzywuzzy module.

>>> x = 'My name is James Bond'
>>> y = 'My name is James Bond and I am an MI-6 agent stationed in London, UK'
>>> 
>>> from fuzzywuzzy import fuzz
>>> fuzz.partial_ratio(x, y)
100
>>> 
>>> x = "My name is James Herbert Bond"
>>> fuzz.partial_ratio(x, y)
72

answered Jul 22 '19 at 21:19

Sunitha

11,777
2
20
23

That is something what I was looking for. – Pankaj Singh Jul 23 '19 at 14:46

score 0 · Answer 2 · answered Jul 27 '21 at 06:02

Sum lengths of non-overlapping matching subsequences and divide by the length of the first sequence.

from difflib import SequenceMatcher
x = 'My name is James Bond'
y = 'My name is James Bond and I am an MI-6 agent stationed in London, UK'
ratio = sum([i.size for i in SequenceMatcher(None, x, y).get_matching_blocks()])/len(x)
print(ratio)

This will get the output of 1.0

How to match a text contained in one variable to another

2 Answers2