How to count how many sentences are similar?

Question

I have a dataset made by 2 columns, one for users and one for texts:

`User`        `Text`
49        there is a cat under the table
21        the sun is hot
431       could you please close the window?
65        there is a cat under the table
21        the sun is hot
53        there is a cat under the table

My expected output would be:

Text                                   Freq         
there is a cat under the table          3
the sun is hot                          2
could you please close the window?      1

My approach is to use fuzz.partial_ratio to determine the match (similarity) between all sentences and then groupby to calculate the frequency.

I am using fuzz.partial_ratio so in case of exactly matching, it will return 1(100):

check_match =df.apply(lambda row: ((fuzz.partial_ratio(row['Text'], row['Text'])) >= value), axis=1)

where value is the threshold. This is to determine matching/similarity

Where is your code for that approach? What doesn't work about it? Stack Overflow is not a coding service; you have to make an honest attempt, and *then* ask a *specific* question about your algorithm or technique. — Prune, Sep 11 '20 at 16:44
you say you intend to use `fuzz.partial_ratio` but in your example you have exactly matching values — anky, Sep 11 '20 at 16:47
Yes, anky. I was considering fuzz.partial_ratio in case I have strings which does not exactly match. In case they match, I should get 1, but in case they do not match I could just set a threshold to group them — , Sep 11 '20 at 16:56
@Prune, I am going to update the question. sorry about that. Since stackoverflow requires a minimal reproducible example, I could not provide it as it is extracted from a code more complex. — , Sep 11 '20 at 16:57
@LucaDiMauro I think you should also edit the example, `fuzzywuzzy` is not justified yet in the example, it is a `value_counts()` (*since the records are exactly matching in the example*) , edit the example to be clear when and why to use `fuzz.ratio` and the expected output accordingly. — anky, Sep 11 '20 at 17:13

score 1 · Accepted Answer · answered Sep 11 '20 at 16:46

1

You could use value_counts()

df['Text'].value_counts()

answered Sep 11 '20 at 16:46

rhug123

7,893
1
9
24

score 0 · Answer 2 · answered Sep 11 '20 at 16:40

0

Try this:

df = df.groupby('Text').count()

answered Sep 11 '20 at 16:40

gtomer

5,643
1
10
21

score 0 · Answer 3 · answered Sep 11 '20 at 16:44

0

The following should work:

from collections import Counter

l=dict(Counter(df.Text))
new_df=pd.DataFrame({'Text':list(d.keys()),'Freq': list(d.values())})

answered Sep 11 '20 at 16:44

IoaTzimas

10,538
2
13
30

How to count how many sentences are similar?

3 Answers3