Comparing strings within two columns in pandas with SequenceMatcher

Question

I am trying to determine the similarity of two columns in a pandas dataframe:

Text1                                                                             All
Performance results achieved by the approaches submitted to this Challenge.       The six top approaches and three others outperform the strong baseline.
Accuracy is one of the basic principles of perfectionist.                             Where am I?

I would like to compare 'Performance results ... ' with 'The six...' and 'Accuracy is one...' with 'Where am I?'. The first row should have a higher similarity degree between the two columns as it includes some words; the second one should be equal to 0 as no words are in common between the two columns.

To compare the two columns I've used SequenceMatcher as follows:

from difflib import SequenceMatcher

ratio = SequenceMatcher(None, df.Text1, df.All).ratio()

but it seems to be wrong the use of df.Text1, df.All.

Can you tell me why?

Trenton McKinney · Accepted Answer · 2020-08-12T18:59:27.650

SequenceMatcher isn't designed for a pandas series.
You could .apply the function.
SequenceMatcher Examples
- With isjunk=None even spaces are not considered junk.
- With isjunk=lambda y: y == " " considers spaces as junk.

from difflib import SequenceMatcher
import pandas as pd

data = {'Text1': ['Performance results achieved by the approaches submitted to this Challenge.', 'Accuracy is one of the basic principles of perfectionist.'],
        'All': ['The six top approaches and three others outperform the strong baseline.', 'Where am I?']}

df = pd.DataFrame(data)

# isjunk=lambda y: y == " "
df['ratio'] = df[['Text1', 'All']].apply(lambda x: SequenceMatcher(lambda y: y == " ", x[0], x[1]).ratio(), axis=1)

# display(df)
                                                                         Text1                                                                      All     ratio
0  Performance results achieved by the approaches submitted to this Challenge.  The six top approaches and three others outperform the strong baseline.  0.356164
1                    Accuracy is one of the basic principles of perfectionist.                                                              Where am I?  0.088235

# isjunk=None
df['ratio'] = df[['Text1', 'All']].apply(lambda x: SequenceMatcher(None, x[0], x[1]).ratio(), axis=1)

# display(df)
                                                                         Text1                                                                      All     ratio
0  Performance results achieved by the approaches submitted to this Challenge.  The six top approaches and three others outperform the strong baseline.  0.410959
1                    Accuracy is one of the basic principles of perfectionist.                                                              Where am I?  0.117647

Thank you so much, your solution work perfectly and save me from 2 hours — Trần Quốc Hoài new 2015, Sep 02 '21 at 17:16

Comparing strings within two columns in pandas with SequenceMatcher

1 Answers1