0

I try to create my own corpus for sentiment analysis of tweets (whether they are positive or negative).

I'm first trying the existing NLTK movie-review corpus. However, if I'm using this code:

import string
from itertools import chain

from nltk.corpus import movie_reviews as mr
from nltk.corpus import stopwords
from nltk.probability import FreqDist
from nltk.classify import NaiveBayesClassifier as nbc
import nltk

stop = stopwords.words('english')
documents = [([w for w in mr.words(i) if w.lower() not in stop and w.lower() not in string.punctuation], i.split('/')[0]) for i in mr.fileids()]

word_features = FreqDist(chain(*[i for i,j in documents]))
word_features = word_features.keys()[:100]

numtrain = int(len(documents) * 90 / 100)
train_set = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag in documents[:numtrain]]
test_set = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag in documents[numtrain:]]

classifier = nbc.train(train_set)
print nltk.classify.accuracy(classifier, test_set)
classifier.show_most_informative_features(5)

Im receiving output:

0.31
Most Informative Features
               uplifting = True              pos : neg    =      5.9 : 1.0
               wednesday = True              pos : neg    =      3.7 : 1.0
             controversy = True              pos : neg    =      3.4 : 1.0
                  shocks = True              pos : neg    =      3.0 : 1.0
                  catchy = True              pos : neg    =      2.6 : 1.0

Instead of the expected output (see Classification using movie review corpus in NLTK/Python ):

0.655
Most Informative Features
                     bad = True              neg : pos    =      2.0 : 1.0
                  script = True              neg : pos    =      1.5 : 1.0
                   world = True              pos : neg    =      1.5 : 1.0
                 nothing = True              neg : pos    =      1.5 : 1.0
                     bad = False             pos : neg    =      1.5 : 1.0

I'm using exactly the same code as in the other StackOverflow page, my NLTK (and theirs) is up to date and I also have the most recent movie-reviews corpus. Anyone with an idea what's going wrong?

Thanks!

Community
  • 1
  • 1
mvh
  • 189
  • 1
  • 2
  • 20

1 Answers1

0

My guess is that the below line is making difference:

word_features = word_features.keys()[:100]

word_features is a dict(Counter more precise) object and keys() method return values in a arbitrary order so the list of features in your training set is different from the list of features in initial post.

https://docs.python.org/2/library/stdtypes.html#dict.items

Manjunath N
  • 1,365
  • 11
  • 22
valentin
  • 3,498
  • 15
  • 23
  • I Don't think that's the problem, because every time I run this code on different computers, I always get the same results (accuracy 0.31 & same most informative features) – mvh Apr 24 '15 at 15:23
  • keys()' are arbitrary but not random and some variations are implementation specific. If I run the code from a Linux box I can have different results then running same code on a Win box. – valentin Apr 24 '15 at 15:37