Graphlab: How to avoid manually duplicating functions that has only a different string variable?

Question

I imported my dataset with SFrame:

products = graphlab.SFrame('amazon_baby.gl')
products['word_count'] = graphlab.text_analytics.count_words(products['review'])

I would like to do sentiment analysis on a set of words shown below:

selected_words = ['awesome', 'great', 'fantastic', 'amazing', 'love', 'horrible', 'bad', 'terrible', 'awful', 'wow', 'hate']

Then I would like to create a new column for each of the selected words in the products matrix and the entry is the number of times such word occurs, so I created a function for the word "awesome":

def awesome_count(word_count):
    if 'awesome' in product:
        return product['awesome']
    else:
        return 0;

products['awesome'] = products['word_count'].apply(awesome_count)

so far so good, but I need to manually create other functions for each of the selected words in this way, e.g., great_count, etc. How to avoid this manual effort and write cleaner code?

papayawarrior · Answer 1 · 2016-01-31T07:05:21.957

0

I think the SFrame.unpack command should do the trick. In fact, the limit parameter will accept your list of selected words and keep only these results, so that part is greatly simplified.

I don't know precisely what's in your reviews data, so I made a toy example:

# Create the data and convert to bag-of-words.
import graphlab
products = graphlab.SFrame({'review':['this book is awesome',
                                      'I hate this book']})

products['word_count'] = \
    graphlab.text_analytics.count_words(products['review'])

# Unpack the bag-of-words into separate columns.
selected_words = ['awesome', 'hate']
products2 = products.unpack('word_count', limit=selected_words)


# Fill in zeros for the missing values.
for word in selected_words:
    col_name = 'word_count.{}'.format(word)
    products2[col_name] = products2[col_name].fillna(value=0)

I also can't help but point out that GraphLab Create does have its own sentiment analysis toolkit, which could be worth checking out.

edited Jan 31 '16 at 07:05

answered Jan 31 '16 at 06:57

papayawarrior

1,027
7
10

Thank you for the help. I spend a bit time and find an easier way with apply and lamda. – drdot Jan 31 '16 at 18:59
Sorry, I don't quite understand. Are you looking for an answer that does uses`apply`, instead of `unpack`? – papayawarrior Jan 31 '16 at 19:55
1

I feel that using apply looks cleaner than using "unpack", "format" and "fillna" method. Feel free to throw in different opinions. – drdot Jan 31 '16 at 21:09
It's worth timing. My hunch is that `unpack` will be much faster than using `apply` for each word separately. – papayawarrior Jan 31 '16 at 21:47

score 0 · Accepted Answer · answered Jan 31 '16 at 18:58

I actually find out an easier way do do this:

def wordCount_select(wc,selectedWord):
    if selectedWord in wc:
        return wc[selectedWord]
    else:
        return 0    


for word in selected_words:
    products[word] = products['word_count'].apply(lambda wc: wordCount_select(wc, word))

Graphlab: How to avoid manually duplicating functions that has only a different string variable?

2 Answers2