-3

I have a dataset made by 2 columns, one for users and one for texts:

`User`        `Text`
49        there is a cat under the table
21        the sun is hot
431       could you please close the window?
65        there is a cat under the table
21        the sun is hot
53        there is a cat under the table

My expected output would be:

Text                                   Freq         
there is a cat under the table          3
the sun is hot                          2
could you please close the window?      1

My approach is to use fuzz.partial_ratio to determine the match (similarity) between all sentences and then groupby to calculate the frequency.

I am using fuzz.partial_ratio so in case of exactly matching, it will return 1(100):

check_match =df.apply(lambda row: ((fuzz.partial_ratio(row['Text'], row['Text'])) >= value), axis=1)

where value is the threshold. This is to determine matching/similarity

E_net4
  • 27,810
  • 13
  • 101
  • 139
  • is this a pandas dataframe ? – Piero Costa Sep 11 '20 at 16:39
  • Yes, it is a pandas dataframe –  Sep 11 '20 at 16:41
  • 1
    Where is your code for that approach? What doesn't work about it? Stack Overflow is not a coding service; you have to make an honest attempt, and *then* ask a *specific* question about your algorithm or technique. – Prune Sep 11 '20 at 16:44
  • you say you intend to use `fuzz.partial_ratio` but in your example you have exactly matching values – anky Sep 11 '20 at 16:47
  • Yes, anky. I was considering fuzz.partial_ratio in case I have strings which does not exactly match. In case they match, I should get 1, but in case they do not match I could just set a threshold to group them –  Sep 11 '20 at 16:56
  • @Prune, I am going to update the question. sorry about that. Since stackoverflow requires a minimal reproducible example, I could not provide it as it is extracted from a code more complex. –  Sep 11 '20 at 16:57
  • 2
    @LucaDiMauro I think you should also edit the example, `fuzzywuzzy` is not justified yet in the example, it is a `value_counts()` (*since the records are exactly matching in the example*) , edit the example to be clear when and why to use `fuzz.ratio` and the expected output accordingly. – anky Sep 11 '20 at 17:13

3 Answers3

1

You could use value_counts()

df['Text'].value_counts()
rhug123
  • 7,893
  • 1
  • 9
  • 24
0

Try this:

df = df.groupby('Text').count()
gtomer
  • 5,643
  • 1
  • 10
  • 21
0

The following should work:

from collections import Counter

l=dict(Counter(df.Text))
new_df=pd.DataFrame({'Text':list(d.keys()),'Freq': list(d.values())})
IoaTzimas
  • 10,538
  • 2
  • 13
  • 30