0

I'm trying to train the classifier for tweets. However, the issue is that it is saying that the classifier has a 100% accuracy and the list of the most informative features doesn't display anything. Does anyone know what I'm doing wrong? I believe all my inputs to the classifier are correct, so I have no idea where it is going wrong.

This is the dataset I'm using: http://thinknook.com/wp-content/uploads/2012/09/Sentiment-Analysis-Dataset.zip

This is my code:

import nltk
import random

file = open('Train/train.txt', 'r')


documents = []
all_words = []           #TODO remove punctuation?
INPUT_TWEETS = 3000

print("Preprocessing...")
for line in (file):

    # Tokenize Tweet content
    tweet_words = nltk.word_tokenize(line[2:])

    sentiment = ""
    if line[0] == 0:
        sentiment = "negative"
    else:
        sentiment = "positive"
    documents.append((tweet_words, sentiment))

    for word in tweet_words:
        all_words.append(word.lower())

    INPUT_TWEETS = INPUT_TWEETS - 1
    if INPUT_TWEETS == 0:
        break

random.shuffle(documents) 


all_words = nltk.FreqDist(all_words)

word_features = list(all_words.keys())[:3000]   #top 3000 words

def find_features(document):
    words = set(document)
    features = {}
    for w in word_features:
        features[w] = (w in words)

    return features

#Categorize as positive or Negative
feature_set = [(find_features(all_words), sentiment) for (all_words, sentment) in documents]


training_set = feature_set[:1000]
testing_set = feature_set[1000:]  

print("Training...")
classifier = nltk.NaiveBayesClassifier.train(training_set)

print("Naive Bayes Accuracy:", (nltk.classify.accuracy(classifier,testing_set))*100)
classifier.show_most_informative_features(15)
Daniel Medina Sada
  • 478
  • 1
  • 5
  • 16
  • 1
    Looks like the problem is in comparing the character at `line[0]` with the `int` `0`. I doubt your input actually uses null bytes to indicate negative sentiment. – alexis Apr 04 '17 at 23:35

1 Answers1

1

There is a typo in your code:

feature_set = [(find_features(all_words), sentiment) for (all_words, sentment) in documents]

This causes sentiment to have the same value all the time (namely the value of the last tweet from your preprocessing step) so training is pointless and all features are irrelevant.

Fix it and you will get:

('Naive Bayes Accuracy:', 66.75)
Most Informative Features
                  -- = True           positi : negati =      6.9 : 1.0
               these = True           positi : negati =      5.6 : 1.0
                face = True           positi : negati =      5.6 : 1.0
                 saw = True           positi : negati =      5.6 : 1.0
                   ] = True           positi : negati =      4.4 : 1.0
               later = True           positi : negati =      4.4 : 1.0
                love = True           positi : negati =      4.1 : 1.0
                  ta = True           positi : negati =      4.0 : 1.0
               quite = True           positi : negati =      4.0 : 1.0
              trying = True           positi : negati =      4.0 : 1.0
               small = True           positi : negati =      4.0 : 1.0
                 thx = True           positi : negati =      4.0 : 1.0
               music = True           positi : negati =      4.0 : 1.0
                   p = True           positi : negati =      4.0 : 1.0
             husband = True           positi : negati =      4.0 : 1.0
acidtobi
  • 1,375
  • 9
  • 13
  • I changed the typo, but my output isn't changing it's still 100% and not showing the features – Daniel Medina Sada Apr 04 '17 at 20:44
  • Then maybe your train.txt is damaged/incomplete? I read the original data into a DataFrame using `df = pd.read_csv('Sentiment Analysis Dataset.csv', error_bad_lines=False, encoding='utf-8')` and iterated over the rows using `df.iterrows()` to get the output pasted above. – acidtobi Apr 04 '17 at 20:53