2

I have a dataset as follows, only with more rows:

import pandas as pd

data = {'First':  ['First value','Third value','Second value','First value','Third value','Second value'],
'Second': ['the old man is here','the young girl is there', 'the old woman is here','the  young boy is there','the young girl is here','the old girl is here']}

df = pd.DataFrame (data, columns = ['First','Second'])

i have calculated the fuzzywuzzy average for the entire dataset like this:

from fuzzywuzzy import fuzz
from fuzzywuzzy import process

def similarity_measure(doc1, doc2): 
    return fuzz.token_set_ratio(doc1, doc2)


d= df.groupby('First')['Second'].apply(lambda x: (', '.join(x)))
d= d.reset_index()
all=[]
for val in list(combinations(range(len(d)), 2)):
    all.append(similarity_measure(d.iloc[val[0],1],d.iloc[val[1],1]))


avg = sum(all)/len(all)
print('lexical overlap between all example pairs in the dataset is: ', avg)

however, I would like to also get this average for each category in the first column separately. so, i would like something like(for example):

similarity average for sentences in First value: 85.56
similarity average for sentences in Second value: 89.01
similarity average for sentences in Third value: 90.01

so I would like to modify the for loop in a way that i would have the above output.

zara kolagar
  • 881
  • 3
  • 15

1 Answers1

1

To compute the mean within each group, you need two steps:

  1. To group by some criteria, in your case column First. It seems that you already know how.
  2. Create a function to compute the similarity for a group the all_similarity_measure function in the code below.

Code

import pandas as pd
from fuzzywuzzy import fuzz
from itertools import combinations


def similarity_measure(doc1, doc2):
    return fuzz.token_set_ratio(doc1, doc2)


data = {'First': ['First value', 'Third value', 'Second value', 'First value', 'Third value', 'Second value'],
        'Second': ['the old man is here', 'the young girl is there', 'the old woman is here', 'the  young boy is there',
                   'the young girl is here', 'the old girl is here']}

df = pd.DataFrame(data, columns=['First', 'Second'])


def all_similarity_measure(gdf):
    """This function computes the similarity between all pairs of sentences in a Series"""
    return pd.Series([similarity_measure(*docs) for docs in combinations(gdf, 2)]).mean()


res = df.groupby('First', as_index=False)['Second'].apply(all_similarity_measure)
print(res)

Output

          First  Second
0   First value    63.0
1  Second value    86.0
2   Third value    98.0

The key to compute the mean similarity is this expression:

return pd.Series([similarity_measure(*docs) for docs in combinations(gdf, 2)]).mean()

basically you generate the pairs of sentences using combinations (no need to access by index), construct a Series and compute mean on it.

Any function for computing the mean can be use instead of the above, for example, you could use statistics.mean, to avoid constructing a Series.

from statistics import mean

def all_similarity_measure(gdf):
    """This function computes the similarity between all pairs of sentences in a Series"""
    return mean(similarity_measure(*docs) for docs in combinations(gdf, 2))
Dani Mesejo
  • 61,499
  • 6
  • 49
  • 76