0

Okay so I trained a NaiveBayes Movie Review Classifier...however when I run it against a negative review (from a website that i copied and pasted into a txt file) i am getting 'pos'...am I doing something wrong? Here is the code below:

import nltk, random
from nltk.corpus import movie_reviews
documents = [(list(movie_reviews.words(fileid)), category)
for category in movie_reviews.categories()
for fileid in movie_reviews.fileids(category)]
random.shuffle(documents)
all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())
word_features = list(all_words)[:2000]

def document_features(document): 
    document_words = set(document) 
    features = {}
    for word in word_features:
        features['contains({})'.format(word)] = (word in document_words)
    return features

featuresets = [(document_features(d), c) for (d,c) in documents]
train_set, test_set = featuresets[100:], featuresets[:100]
classifier = nltk.NaiveBayesClassifier.train(train_set)

print(nltk.classify.accuracy(classifier, test_set)) 
classifier.show_most_informative_features(5)
>>>0.67
>>>Most Informative Features
      contains(thematic) = True              pos : neg    =      8.9 : 1.0
        contains(annual) = True              pos : neg    =      8.9 : 1.0
       contains(miscast) = True              neg : pos    =      8.7 : 1.0
      contains(supports) = True              pos : neg    =      6.9 : 1.0
    contains(unbearable) = True              neg : pos    =      6.7 : 1.0

f = open('negative_review.txt','rU')
fraw = f.read()
review_tokens =nltk.word_tokenize(fraw)
docfts = document_features(review_tokens)

classifier.classify(docfts)
>>>    'pos'

UPDATE After re-running the program several times, it now accurately classifies my negative review as negative...Can someone help me understand why? Or is this plain sorcery?

1 Answers1

1

Classifiers are not 100% accurate. A better test would be to see how the classifier behaves with multiple movie reviews. I see that the accuracy of the classifier is 67%, meaning that 1/3 reviews will be misclassified. You can try improving the model by using a different classifier or different features (try n-gram and word2vec).

megadarkfriend
  • 373
  • 1
  • 16
  • the assignment asked to use only the NaiveBayes Classifier :/ – Kimberly James Mar 01 '17 at 06:30
  • There's nothing wrong with your code, you just have to improve your features. Is there a certain accuracy threshold you have to hit? – megadarkfriend Mar 01 '17 at 06:34
  • nah...Actually what is weird is after re-running a few times...it actually classifies my negative review as negative! This is so weird...I would take a screenshot of this running and post that under my assignment! Also the accuracy rose on its own to 0.7! Is this sorcery? – Kimberly James Mar 01 '17 at 07:15
  • *That* could be a problem. Some variation is to be expected since you `random.shuffle` the training vs. testing data, but if your accuracy has been consistently moving upwards, something is wrong. – alexis Mar 01 '17 at 09:14
  • I think you should run your model through a cross validator. Alexis is correct, your accuracy shouldn't consistently move up. K-Fold cross validation will give you a more accurate description of the correctness of your model. – megadarkfriend Mar 01 '17 at 09:37