3

I am trying to use a Naive Bayes classifier from the sklearn module to classify whether movie reviews are positive. I am using a bag of words as the features for each review and a large dataset with sentiment scores attached to reviews.

df_bows = pd.DataFrame.from_records(bag_of_words)
df_bows = df_bows.fillna(0).astype(int)

This code creates a pandas dataframe which looks like this:

   The  Rock  is  destined  to  ...  Staggeringly  ’  ve  muttering  dissing
0    1     1   1         1   2  ...             0  0   0          0        0
1    2     0   1         0   0  ...             0  0   0          0        0
2    0     0   0         0   0  ...             0  0   0          0        0
3    0     0   1         0   4  ...             0  0   0          0        0
4    0     0   0         0   0  ...             0  0   0          0        0

I then try and fit this data frame with the sentiment of each review using this code

nb = MultinomialNB()
nb = nb.fit(df_bows, movies.sentiment > 0)

However I get an error which says

AttributeError: 'Series' object has no attribute 'to_coo'

This is what the df movies looks like.

    sentiment                                               text
id                                                              
1    2.266667  The Rock is destined to be the 21st Century's ...
2    3.533333  The gorgeously elaborate continuation of ''The...
3   -0.600000                     Effective but too tepid biopic
4    1.466667  If you sometimes like to go to the movies to h...
5    1.733333  Emerges as something rare, an issue movie that...

Can you help with this?

  • 1
    assuming movies is a dataframe with the column sentiment, this should work, could you show what movies.sentiment looks like? – Ezer K Jul 21 '20 at 19:42
  • @EzerK I have edited the question to include the movies dataframe – Luke Turvey Jul 22 '20 at 07:18
  • 1
    seems like `MultinomialNB` has an issue with the `df`, did you try to pass the values instead? e.g. `nb.fit(df_bows.values, movies.sentiment > 0)` – FObersteiner Jul 22 '20 at 07:26
  • another idea is to run only on first couple of lines to see where the error occures – Ezer K Jul 22 '20 at 08:59

1 Answers1

0

When you're trying to fit your MultinomialNB model, sklearn's routine checks if the input df_bows is sparse or not. If it is, just like in our case, it is required to change the dataframe's type to 'Sparse'. Here is the way I fixed it :

df_bows = pd.DataFrame.from_records(bag_of_words)

# Keep NaN values and convert to Sparse type
sparse_bows = df_bows.astype('Sparse')

nb = nb.fit(sparse_bows, movies['sentiment'] > 0)

Link to Pandas doc : pandas.Series.sparse.to_coo

César R.
  • 31
  • 4