Compare two text columns to measure their similarity in a dataframe in python

Asked May 03 '22 at 01:18

Active May 03 '22 at 03:53

Viewed 304 times

I want to compare columns A with C and also B with C and measure each pair's similarity and then report the one that has a higher degree of similarity.

df = pd.DataFrame([['JAMES LIKEN', 'LINDEN R. EVANS', 'LINDEN R. EVANS'], ['HENRY THEISEN', 'SCOTT ULLEM', 'Henry J. Theisen']])
df.columns = ['A', 'B', 'C']

Result should be in the form of three columns. The first two contain similarity ratio and the third column should contain either column A or B, whichever that is more similar to C. I used fuzz.partial_ratio and sequencematcher, and used apply and lambda to use the function for each row, but it led to error.

edited May 03 '22 at 03:53

monk

asked May 03 '22 at 01:18

mehdi samimi

Can you add the code you've tried and an example of the wanted result, that'd greatly help. – monk May 03 '22 at 01:26
I tried different codes, but none of them worked. Here are some examples: df['sim'] = df.apply(lambda x: fuzz.partial_ratio(x['A'], x['C']), axis=1) df['sim'] = df[['A', 'C']].apply(lambda x: SequenceMatcher(lambda y: y == " ", x[0], x[1]).ratio(), axis=1) – mehdi samimi May 03 '22 at 01:31

Compare two text columns to measure their similarity in a dataframe in python

0 Answers0