I have a dataset made by 2 columns, one for users and one for texts:
`User` `Text`
49 there is a cat under the table
21 the sun is hot
431 could you please close the window?
65 there is a cat under the table
21 the sun is hot
53 there is a cat under the table
My expected output would be:
Text Freq
there is a cat under the table 3
the sun is hot 2
could you please close the window? 1
My approach is to use fuzz.partial_ratio
to determine the match (similarity) between all sentences and then groupby to calculate the frequency.
I am using fuzz.partial_ratio so in case of exactly matching, it will return 1(100):
check_match =df.apply(lambda row: ((fuzz.partial_ratio(row['Text'], row['Text'])) >= value), axis=1)
where value is the threshold. This is to determine matching/similarity