I have recently begun trying to get into Machine Learning and was following a tutorial that created a model that could determine if an inputted tweet was positive or negative. The program worked decently well, but it wasn't very accurate and I didn't want to tackle working with the Twitter API yet, so I tried to convert the program to predict a movie review's stance (positive/negative). I figured it would be easier and once I got it working I could try Twitter.
However, now that I finally got it up and running I always get 'Positive' as the outcome, my training data is a set of 400 positive and 400 negative movie reviews.
Here is where I got my dataset from: (the exact link is the first one under 'Sentiment polarity datasets', called 'polarity dataset v2.0 ( 3.0Mb). http://www.cs.cornell.edu/people/pabo/movie-review-data/
I did not use all 2000 reviews, only the first 400 from positive and 400 from negative.
import nltk
import glob
import errno
path = r"C:\Users\Thomas\tweets\pos\*.txt"
files = glob.glob(path)
pos_rev = []
neg_rev = []
for name in files:
try:
with open(name) as f:
content = f.read()
pos_rev.append((content, 'positive'))
except IOError as exc:
if exc.errno != errno.EISDIR:
raise
for name in files:
try:
with open(name) as f:
content = f.read()
neg_rev.append((content, 'negative'))
except IOError as exc:
if exc.errno != errno.EISDIR:
raise
# create array to store all reviews
reviews = []
# seperate reviews into individual words, removing words 2 words or less
# create training set (reviews)
for (words, sentiment) in pos_rev + neg_rev:
words_filtered = [e.lower() for e in words.split() if len(e) >= 3]
reviews.append((words_filtered, sentiment))
def get_words_in_reviews(reviews):
all_words = []
for (words, sentiment) in reviews:
all_words.extend(words)
return all_words
def get_word_features(wordlist):
wordlist = nltk.FreqDist(wordlist)
word_features = wordlist.keys()
return word_features
word_features = get_word_features(get_words_in_reviews(reviews))
def extract_features(document):
document_words = set(document)
features = {}
for word in word_features:
features['contains(%s)' % word] = (word in document_words)
return features
training_set = nltk.classify.apply_features(extract_features, reviews)
classifier = nltk.NaiveBayesClassifier.train(training_set)
review = 'That movie was very bad. Poor directing, terrible acting and horrible production.'
print(classifier.classify(extract_features(review.split())))
No matter what I put into the classifier, it always comes back positive.
Also, if there's anyone still reading down here, what exactly does this do:
except IOError as exc:
if exc.errno != errno.EISDIR:
raise
I know it is there in case there is an error when trying to open the files, but is there anything significant I should know about IOError, .errno, the != and raise? Or is this just the standard except block when reading in files?
Thanks in advance for any help!