0

I am trying to count the number of times a word exists in a dict column based on a subset of interested words.

First I import my data

products = graphlab.SFrame('amazon_baby.gl/')
products['word_count'] = graphlab.text_analytics.count_words(products['review'])
products.head(5)

Data can be found here: https://drive.google.com/open?id=0BzbhZp-qIglxM3VSVWRsVFRhTWc

I then create list of words i am interested in:

words = ['awesome', 'great', 'fantastic']

I would like to count the number of times each word in "words" occurs in the products['word_count'].

I am not married to using graphlab. It was just suggested to me by a colleague.

papayawarrior
  • 1,027
  • 7
  • 10
Trexion Kameha
  • 3,362
  • 10
  • 34
  • 60
  • Welcome to SO. We'd like to see evidence of your effort to complete your code. As is it looks like you have the bare structure and don't know how to complete it, which isn't what SO is for. Please read "[ask]" including the links and "[mcve]". I'd also recommend reading http://meta.stackoverflow.com/q/261592/128421. – the Tin Man Jun 06 '16 at 21:52

5 Answers5

1

Well, I am not pretty sure about what you mean by 'in a dict column'. If it is a list:

import operator
dictionary={'texts':['red blue blue','red black','blue white white','red','white','black','blue red']}
words=['red','white','blue']
freqs=dict()
for t in dictionary['texts']:
    for w in words:
        try:
             freqs[w]+=t.count(w)
        except:
            freqs[w]=t.count(w)
top_words = sorted(freqs.items(), key=operator.itemgetter(1),reverse=True)

If it is just one text:

import operator
dictionary={'text':'red blue blue red black blue white white red white black blue red'}
words=['red','white','blue']
freqs=dict()
for w in words:
    try:
        freqs[w]+=dictionary['text'].count(w)
    except:
        freqs[w]=dictionary['text'].count(w)
top_words = sorted(freqs.items(), key=operator.itemgetter(1),reverse=True) 
hipoglucido
  • 545
  • 1
  • 7
  • 20
1

If you want to count occurrences of words, a fast way to do it is to use Counterobject from collections

For example :

In [3]: from collections import Counter
In [4]: c = Counter(['hello', 'world'])

In [5]: c
Out[5]: Counter({'hello': 1, 'world': 1})

Could you show the output of your products.head(5) command ?

arthur
  • 2,319
  • 1
  • 17
  • 24
1

If you stick with graphlab (or SFrame), use the SArray.dict_trim_by_keys method. The documentation is here: https://dato.com/products/create/docs/generated/graphlab.SArray.dict_trim_by_keys.html

import graphlab as gl
sf = gl.SFrame({'review': ['what a good book', 'terrible book']})
sf['word_bag'] = gl.text_analytics.count_words(sf['review'])

keywords = ['good', 'book']
sf['key_words'] = sf['word_bag'].dict_trim_by_keys(keywords, exclude=False)
print sf

+------------------+---------------------+---------------------+
|      review      |       word_bag      |      key_words      |
+------------------+---------------------+---------------------+
| what a good book | {'a': 1, 'good':... | {'good': 1, 'boo... |
|  terrible book   | {'book': 1, 'ter... |     {'book': 1}     |
+------------------+---------------------+---------------------+ 
[2 rows x 3 columns]
papayawarrior
  • 1,027
  • 7
  • 10
0

Do you want to put each of the counts in a separate column? In that case this may work:

keywords = ['keyword1' , 'keyword2']

def word_counter(dict_cell , word):
if word in dict_cell:
    return dict_cell[word]
else:
    return 0

for words in keywords:
  df[words] = df['word_count'].apply(lambda x:word_counter(x,words))
Leila S
  • 11
  • 2
0
def count_words(x, w):
    if w in x:
        return x.count(w)
    else:
        return 0   

selected_words = ['awesome', 'great', 'fantastic', 'amazing', 'love', 'horrible', 'bad', 'terrible', 'awful', 'wow', 'hate']

for words in selected_words:
    products[words]=products['review'].apply(lambda x:count_words(x,words))
Suraj Rao
  • 29,388
  • 11
  • 94
  • 103