Difflib sequencematcher with sentences

Question

I have the following dataframe

Column1         Column2
tomato fruit    tomatoes are not a fruit
potato la best  potatoe are some sort of fruit
apple           there are great benefits to appel
pear            peer

and I would like to look up the word/sentence on the left with the sentences on the right and if there is a match on the max first two words (e.g. 'potato la' and leave out 'best') then it would give a score.

I have already used two different methods:

for i in range(0, len(Column1)):
     store_it = SM(None, Column1[i], Column2[i]).get_matching_blocks()
     print(store_it)

And

df['diff'] = df.apply(lambda x: diff.SequenceMatcher(None, x[0].strip(), x[1].strip()).ratio(), axis=1)

which I found on the internet.

The second one works fine, except that it tries to match the entire phrase. How can I match the words in the first column with the sentences in the second column so that it ultimately gives me a 'Yes' they are in the sentence (or partially) or 'No' they aren't.

score 1 · Answer 1 · answered Oct 04 '18 at 20:08

I had the best success using FuzzyWuzzy's partial ratio on this one. It will give you the ratio of partial % match between Column1 "tomato fruit" and Column2 "tomatos are not a fruit" and the rest of the way down the columns. See results:

from fuzzywuzzy import fuzz
import difflib

df['fuzz_partial_ratio'] = df.apply(lambda x: fuzz.partial_ratio(x['Column1'], x['Column2']), axis=1)

df['sequence_ratio'] = df.apply(lambda x: difflib.SequenceMatcher(None, x['Column1'], x['Column2']).ratio(), axis=1)

You can consider any FuzzyWuzzy score > 60 to be a good partial match, i.e. yes the words in Column1 are most likely in the sentence in Column2.

row 1- score 67, row 2- score 71, row 3- score 80, row 4- score 75

score 0 · Answer 2 · answered Jun 03 '17 at 08:52

Use set():

Python » Documentation
issubset(other)
set <= other
Test whether every element in the set is in other.

For instance:

c_set1 = set(Column1[i])
c_set2 = set(Column2[i])
if  c_set1.issubset(c_set2):
    # every in  c_set1 is in  c_set2

Difflib sequencematcher with sentences

2 Answers2