3

I have recently begun trying to get into Machine Learning and was following a tutorial that created a model that could determine if an inputted tweet was positive or negative. The program worked decently well, but it wasn't very accurate and I didn't want to tackle working with the Twitter API yet, so I tried to convert the program to predict a movie review's stance (positive/negative). I figured it would be easier and once I got it working I could try Twitter.

However, now that I finally got it up and running I always get 'Positive' as the outcome, my training data is a set of 400 positive and 400 negative movie reviews.

Here is where I got my dataset from: (the exact link is the first one under 'Sentiment polarity datasets', called 'polarity dataset v2.0 ( 3.0Mb). http://www.cs.cornell.edu/people/pabo/movie-review-data/

I did not use all 2000 reviews, only the first 400 from positive and 400 from negative.

import nltk
import glob
import errno


path = r"C:\Users\Thomas\tweets\pos\*.txt"
files = glob.glob(path)

pos_rev = []
neg_rev = []
for name in files:
    try:
        with open(name) as f:
            content = f.read()
            pos_rev.append((content, 'positive'))

    except IOError as exc:
        if exc.errno != errno.EISDIR:
            raise
for name in files:
    try:
        with open(name) as f:
            content = f.read()
            neg_rev.append((content, 'negative'))
    except IOError as exc:
        if exc.errno != errno.EISDIR:
            raise
# create array to store all reviews
reviews = []

# seperate reviews into individual words, removing words 2 words or less
# create training set (reviews)
for (words, sentiment) in pos_rev + neg_rev:
    words_filtered = [e.lower() for e in words.split() if len(e) >= 3]
    reviews.append((words_filtered, sentiment))




def get_words_in_reviews(reviews):
    all_words = []
    for (words, sentiment) in reviews:
        all_words.extend(words)
    return all_words
def get_word_features(wordlist):
    wordlist = nltk.FreqDist(wordlist)
    word_features = wordlist.keys()
    return word_features

word_features = get_word_features(get_words_in_reviews(reviews))

def extract_features(document):
    document_words = set(document)
    features = {}
    for word in word_features:
        features['contains(%s)' % word] = (word in document_words)
    return features

training_set = nltk.classify.apply_features(extract_features, reviews)
classifier = nltk.NaiveBayesClassifier.train(training_set)

review = 'That movie was very bad.  Poor directing, terrible acting and horrible production.'
print(classifier.classify(extract_features(review.split())))

No matter what I put into the classifier, it always comes back positive.

Also, if there's anyone still reading down here, what exactly does this do:

except IOError as exc:
    if exc.errno != errno.EISDIR:
        raise

I know it is there in case there is an error when trying to open the files, but is there anything significant I should know about IOError, .errno, the != and raise? Or is this just the standard except block when reading in files?

Thanks in advance for any help!

Tom D
  • 65
  • 1
  • 1
  • 4

0 Answers0