pythonic way to count the number of times words from a list / set occur in a dataframe column

Question

Given a list / set of labels

labels = {'rectangle', 'square', 'triangle', 'cube'}

and a dataframe df,

df = pd.DataFrame(['rectangle rectangle in my square cube', 'triangle circle not here', 'nothing here'], columns=['text'])

I want to know how many times each word in my set of labels occurred in the text column of the dataframe and create a new column which has the top X (maybe 2 or 3) most repeated words. If 2 words are repeated equally as much then they can appear in a list or string

Output:

pd.DataFrame({'text' : ['rectangle rectangle in my square cube', 'triangle circle not here', 'nothing here'], 'best_labels' : [{'rectangle' : 2, 'square' : 1, 'cube' : 1}, {'triangle' : 1, 'circle' : 1}, np.nan]})                                                                                                                          
                                                                                                                      
df['best_labels'] = some_function(df.text)

Is `pd.DataFrame({'text' : ['rectangle rectangle in my square cube', 'triangle circle not here', 'nothing here'], 'best_labels' : [{'rectangle' : 2, 'square' : 1, 'cube' : 1}, {'triangle' : 1, 'circle' : 1}, np.nan]})` something you have, or is that part of the expected output? — Red, Jun 28 '20 at 21:28
Why not just leave an empty set in `best_labels` for the case where nothing matches? `np.nan` ("not a number") is a strange "default" value to use here, since *none of the valid values are numbers either*. — Karl Knechtel, Jun 28 '20 at 21:41

score 5 · Accepted Answer · answered Jun 28 '20 at 21:29

from collections import Counter

labels = {'rectangle', 'square', 'triangle', 'cube'}    
df = pd.DataFrame(['rectangle rectangle in my square cube', 'triangle circle not here', 'nothing here'], columns=['text'])
    
df['best_labels'] = df.text.apply(lambda x: {k: v for k, v in Counter(x.split()).items() if k in labels} or np.nan)    
print(df)

Prints:

                                    text                               best_labels
0  rectangle rectangle in my square cube  {'rectangle': 2, 'square': 1, 'cube': 1}
1               triangle circle not here                           {'triangle': 1}
2                           nothing here                                       NaN

cs95 · Answer 2 · 2020-06-28T22:04:21.690

Another way to visualize your data is with a matrix:

(df['text'].str.extractall(r'\b({})\b'.format('|'.join(labels)))
           .groupby(level=0)[0]
           .value_counts()
           .unstack()
           .reindex(df.index)
           .rename_axis(None, axis=1))

   cube  rectangle  square  triangle
0   1.0        2.0     1.0       NaN
1   NaN        NaN     NaN       1.0
2   NaN        NaN     NaN       NaN

The idea is to extract text from rows which are specified in labels, then find how many times they occur per sentence.

What does this look like? Yup, you guessed it, a sparse matrix.

pythonic way to count the number of times words from a list / set occur in a dataframe column

2 Answers2