3

Given a list / set of labels

labels = {'rectangle', 'square', 'triangle', 'cube'}

and a dataframe df,

df = pd.DataFrame(['rectangle rectangle in my square cube', 'triangle circle not here', 'nothing here'], columns=['text'])

I want to know how many times each word in my set of labels occurred in the text column of the dataframe and create a new column which has the top X (maybe 2 or 3) most repeated words. If 2 words are repeated equally as much then they can appear in a list or string

Output:

pd.DataFrame({'text' : ['rectangle rectangle in my square cube', 'triangle circle not here', 'nothing here'], 'best_labels' : [{'rectangle' : 2, 'square' : 1, 'cube' : 1}, {'triangle' : 1, 'circle' : 1}, np.nan]})                                                                                                                          
                                                                                                                      
df['best_labels'] = some_function(df.text) 
sammywemmy
  • 27,093
  • 4
  • 17
  • 31
v_coder12
  • 170
  • 2
  • 9
  • Is `pd.DataFrame({'text' : ['rectangle rectangle in my square cube', 'triangle circle not here', 'nothing here'], 'best_labels' : [{'rectangle' : 2, 'square' : 1, 'cube' : 1}, {'triangle' : 1, 'circle' : 1}, np.nan]})` something you have, or is that part of the expected output? – Red Jun 28 '20 at 21:28
  • Why not just leave an empty set in `best_labels` for the case where nothing matches? `np.nan` ("not a number") is a strange "default" value to use here, since *none of the valid values are numbers either*. – Karl Knechtel Jun 28 '20 at 21:41

2 Answers2

5
from collections import Counter

labels = {'rectangle', 'square', 'triangle', 'cube'}    
df = pd.DataFrame(['rectangle rectangle in my square cube', 'triangle circle not here', 'nothing here'], columns=['text'])
    
df['best_labels'] = df.text.apply(lambda x: {k: v for k, v in Counter(x.split()).items() if k in labels} or np.nan)    
print(df)

Prints:

                                    text                               best_labels
0  rectangle rectangle in my square cube  {'rectangle': 2, 'square': 1, 'cube': 1}
1               triangle circle not here                           {'triangle': 1}
2                           nothing here                                       NaN
Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91
4

Another way to visualize your data is with a matrix:

(df['text'].str.extractall(r'\b({})\b'.format('|'.join(labels)))
           .groupby(level=0)[0]
           .value_counts()
           .unstack()
           .reindex(df.index)
           .rename_axis(None, axis=1))

   cube  rectangle  square  triangle
0   1.0        2.0     1.0       NaN
1   NaN        NaN     NaN       1.0
2   NaN        NaN     NaN       NaN

The idea is to extract text from rows which are specified in labels, then find how many times they occur per sentence.

What does this look like? Yup, you guessed it, a sparse matrix.

cs95
  • 379,657
  • 97
  • 704
  • 746