I am trying to determine the similarity of two columns in a pandas dataframe:
Text1 All
Performance results achieved by the approaches submitted to this Challenge. The six top approaches and three others outperform the strong baseline.
Accuracy is one of the basic principles of perfectionist. Where am I?
I would like to compare 'Performance results ... '
with 'The six...'
and 'Accuracy is one...'
with 'Where am I?'
.
The first row should have a higher similarity degree between the two columns as it includes some words; the second one should be equal to 0 as no words are in common between the two columns.
To compare the two columns I've used SequenceMatcher
as follows:
from difflib import SequenceMatcher
ratio = SequenceMatcher(None, df.Text1, df.All).ratio()
but it seems to be wrong the use of df.Text1, df.All
.
Can you tell me why?