I have a dataset as follows, only with more rows:
import pandas as pd
data = {'First': ['First value','Third value','Second value','First value','Third value','Second value'],
'Second': ['the old man is here','the young girl is there', 'the old woman is here','the young boy is there','the young girl is here','the old girl is here']}
df = pd.DataFrame (data, columns = ['First','Second'])
i have calculated the fuzzywuzzy average for the entire dataset like this:
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
def similarity_measure(doc1, doc2):
return fuzz.token_set_ratio(doc1, doc2)
d= df.groupby('First')['Second'].apply(lambda x: (', '.join(x)))
d= d.reset_index()
all=[]
for val in list(combinations(range(len(d)), 2)):
all.append(similarity_measure(d.iloc[val[0],1],d.iloc[val[1],1]))
avg = sum(all)/len(all)
print('lexical overlap between all example pairs in the dataset is: ', avg)
however, I would like to also get this average for each category in the first column separately. so, i would like something like(for example):
similarity average for sentences in First value: 85.56
similarity average for sentences in Second value: 89.01
similarity average for sentences in Third value: 90.01
so I would like to modify the for loop in a way that i would have the above output.