Why did NLTK NaiveBayes classifier misclassify one record?

Question

This is the first time I am building a sentiment analysis machine learning model using the nltk NaiveBayesClassifier in Python. I know it is too simple of a model, but it is just a first step for me and I will try tokenized sentences next time.

The real issue I have with my current model is: I have clearly labeled the word 'bad' as negative in the training data set (as you can see from the 'negative_vocab' variable). However, when I ran the NaiveBayesClassifier on each sentence (lower case) in the list ['awesome movie', ' i like it', ' it is so bad'], the classifier mistakenly labeled 'it is so bad' as positive.

INPUT:

import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import names

positive_vocab = [ 'awesome', 'outstanding', 'fantastic', 'terrific', 'good', 'nice', 'great', ':)' ]
negative_vocab = [ 'bad', 'terrible','useless', 'hate', ':(' ]
neutral_vocab = [ 'movie','the','sound','was','is','actors','did','know','words','not','it','so','really' ]

def word_feats(words):
    return dict([(word, True) for word in words])

positive_features_1 = [(word_feats(positive_vocab), 'pos')]
negative_features_1 = [(word_feats(negative_vocab), 'neg')]
neutral_features_1 = [(word_feats(neutral_vocab), 'neu')]

train_set = negative_features_1 + positive_features_1 + neutral_features_1

classifier = NaiveBayesClassifier.train(train_set) 

# Predict
neg = 0
pos = 0
sentence = "Awesome movie. I like it. It is so bad"
sentence = sentence.lower()
words = sentence.split('.')

def word_feat(word):
    return dict([(word,True)])
#NOTE THAT THE FUNCTION 'word_feat(word)' I WROTE HERE IS DIFFERENT FROM THE 'word_feat(words)' FUNCTION I DEFINED EARLIER. THIS FUNCTION IS USED TO ITERATE OVER EACH OF THE THREE ELEMENTS IN THE LIST ['awesome movie', ' i like it', ' it is so bad'].

for word in words:
    classResult = classifier.classify(word_feat(word))
    if classResult == 'neg':
        neg = neg + 1
    if classResult == 'pos':
        pos = pos + 1
    print(str(word) + ' is ' + str(classResult))
    print()

OUTPUT:

awesome movie is pos

i like it is pos

it is so bad is pos

To make sure the function 'word_feat(word)' iterates over each sentences instead of each word or letter, I did some diagnostic codes to see what is each element in 'word_feat(word)':

for word in words:
    print(word_feat(word))

And it printed out:

{'awesome movie': True}
{' i like it': True}
{' it is so bad': True}

So it seems like the function 'word_feat(word)' is correct?

Does anyone know why the classifier classified 'It is so bad' as positive? As mentioned before, I had clearly labeled the word 'bad' as negative in my training data.

Can you try a neutral word and see it the output is coming in as neutral or positive? — 23nigam, Jan 19 '18 at 06:43
E.g. `breaking bad is really a good drama` , where `bad -> neutral` ? — alvas, Jan 19 '18 at 06:58
It's a statistical model, there can be many thing that causes that output that you may not desire but it might not be wrong. E.g. preprocess, data bias, backoff strategy, etc. — alvas, Jan 19 '18 at 06:59
You cannot expect from machine learning models to correctly classify EVERY instance. You need to produce some metrics (such as accuracy, confusion matrices etc.) in order to evaluate its performance. After computing such metrics you can then analyse incorrectly classified points and see whether you can improve the performance by (e.g.) introducing more features. — Giorgos Myrianthous, Jan 19 '18 at 07:20
@23nigam, I tried running it on individual words (i.e. 'movie', 'bad') and the algorithm classified them correctly. But when I put the words into sentences (i.e. "Awesome." "I like it." "It is so bad"), it would classify the sentence "It is so bad" as positive. The only thing I could think of that would cause that to happen is that the the algorithm is sentence-dependent (meaning the sentiment of one sentence is influenced by the sentiment of the previous sentence), but I doubt that's the case. — Stanleyrr, Jan 20 '18 at 01:28
Is there a copy-and-paste mistake in your listing? `word_feats`, `positive_vocab`, `negative_vocab`, `neutral_vocab` are all defined twice. — Darren Cook, Jan 20 '18 at 09:40
Good catch, @Darren! More important, `train_set` and the classifier itself are defined twice, and with different inputs. Clean up your code, Stanleyrr! — alexis, Jan 20 '18 at 11:38
Thanks @DarrenCook. Sorry about my error. I mistakenly defined those variables twice. I had fixed that (the revised code should reflect that), but the output is still the same. I still couldn't figure out the cause of the misclassification. — Stanleyrr, Jan 20 '18 at 19:35

score 3 · Answer 1 · answered Jan 19 '18 at 19:32

3

This particular failure is because your word_feats() function expects a list of words (a tokenized sentence), but you pass it each word separately... so word_feats() iterates over its letters. You've built a classifier that classifies strings as positive or negative on the basis of the letters they contain.

You're probably in this predicament because you pay no attention to what you name your variables. In your main loop, none of the variables sentence, words, or word contain what their name claims. To understand and improve your program, start by naming things properly.

Bugs aside, this is not how you build a sentiment classifier. The training data should be a list of tokenized sentences (each labeled with its sentiment), not a list of individual words. Similarly, you classify tokenized sentences.

answered Jan 19 '18 at 19:32

alexis

48,685
16
101
161

I thought my word_feats() function iterates over words, not letters. For example, when I ran the code 'word_feats(positive_vocab)', it returned '{'nice': True, 'outstanding': True, 'great': True, 'terrific': True, ':)': True, 'good': True, 'awesome': True, 'fantastic': True}'. So it was iterating over words, right? I agree that I should build the training data on tokenized sentences, but like I mentioned I am still a rookie in this area. I will implement the tokenized sentences once I become more familiar with NLP. – Stanleyrr Jan 20 '18 at 01:16
The example in your comment iterates over words because you passed it a list of words. The code in your question passes `word_feats()` a single string, because you iterate over the list _before_ you call it. Make `word_feats()` print out its argument and the dictionary it builds, and you'll see for yourself. – alexis Jan 20 '18 at 11:34
1

@Darren's comment under your question is spot-on: You actually define two classifiers (the second overwriting the first), one with word list inputs and one with string inputs. But your main loop classifies strings. Clean up your code, name variables appropriately, and pay attention to your data structures! The more so when asking a question. – alexis Jan 20 '18 at 11:41
I had fixed my code and updated them in my question section. The output still misclassifies the sentence 'it is so bad' as positive. When I print out 'word_feats(words)', with 'words' referring to the list ['awesome movie', ' i like it', ' it is so bad'], it correctly printed out '{'awesome movie': True, ' i like it': True, ' it is so bad': True}. So that means it must've iterated over each sentence in the list and not the strings, right? – Stanleyrr Jan 20 '18 at 19:52
If you want to know what your code iterated over, print out some diagnostic output. Random strangers on the internet, no matter how experienced, are not nearly as reliable. – alexis Jan 20 '18 at 20:24
I defined another function 'word_feat(word)' and added it to my code. I tried to make sure my code iterates over the right sentences instead of words or string, so I did 'for word in words: print(word_feat(word))'. It printed out '{'awesome movie': True}', '{' i like it': True}', and '{' it is so bad': True}'. Is this what you meant by print out diagnostic output? Sorry I ask so many questions. I know some of these question might seem obvious to an NLP expert like you, but as a passionate newbie to this field, I am simply trying to learn to code better :) – Stanleyrr Jan 20 '18 at 20:41
Yes, that's what I meant. You can tell that this is not how it needs to work, can't you? Your "features" are whole sentences! If you don't see why that's a problem, I suggest you read the chapter in the nltk book, it's a detailed tutorial. And stop thinking you can "implement tokenized sentences" later... – alexis Jan 20 '18 at 22:14

score 1 · Answer 2 · answered Jan 20 '18 at 21:41

Let me show a rewriting of your code. All I changed near the top was adding import re, as it is easier to tokenize with regexes. Everything else up to defining classifier is the same as your code.

I added one more test case (something really, really negative), but more importantly I used proper variable names - then it is much harder to get confused about what is going on:

test_data = "Awesome movie. I like it. It is so bad. I hate this terrible useless movie."
sentences = test_data.lower().split('.')

So sentences now contains 4 strings, each a single sentence. I left your word_feat() function unchanged.

For using the classifier I did quite a big rewrite:

for sentence in sentences:
    if(len(sentence) == 0):continue
    neg = 0
    pos = 0
    for word in re.findall(r"[\w']+", sentence):
        classResult = classifier.classify(word_feat(word))
        print(word, classResult)
        if classResult == 'neg':
            neg = neg + 1
        if classResult == 'pos':
            pos = pos + 1
    print("\n%s: %d vs -%d\n"%(sentence,pos,neg))

The outer loop is again descriptive, so that sentence contains one sentence.

I then have an inner loop where we classify each word in the sentence; I am using a regex to split the sentence up on whitespace and punctuation marks:

 for word in re.findall(r"[\w']+", sentence):
     classResult = classifier.classify(word_feat(word))

The rest is just basic adding up and reporting. I get this output:

awesome pos
movie neu

awesome movie: 1 vs -0

i pos
like pos
it pos

 i like it: 3 vs -0

it pos
is neu
so pos
bad neg

 it is so bad: 2 vs -1

i pos
hate neg
this pos
terrible neg
useless neg
movie neu

 i hate this terrible useless movie: 2 vs -3

I still get the same as you - "it is so bad" is considered positive. And with the extra debug lines we can see it is because "it" and "so" are considered positive words, and "bad" is the only negative word, so overall it is positive.

I suspect this is because it hadn't seen those words in its training data.

...yes, if I add "it" and "so" to the list of neutral words, I get "it is so bad: 0 vs -1".

As next things to try, I'd suggest:

Try with more training data; toy examples like this carry the risk that the noise will swamp the signal.
Look into removing stop words.

`s/re.findall(r"[\w']+",/nltk.word_tokenize(/`. As a matter of principle and future uses... — alexis, Jan 20 '18 at 22:15
@Darren, thank you! This is super helpful information. It's a good idea to print out the classification of each word in the sentences like you did - I should do that more often. So I added the words 'it', 'so' and 'really' to my 'neutral_vocab' variable, and then tried the classification again. Strangely, the word 'it', 'so' and 'really' by itself is classified as neutral. But when I classify the sentence 'really bad', it still returned positive. At this point I am going to try some other sentiment analysis functions in Python, adding more training data to the model and removing stop words. — Stanleyrr, Jan 20 '18 at 22:41

Gunjan · Accepted Answer · 2018-01-22T09:26:58.573

Here is the modified code for you

import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import names
from nltk.corpus import stopwords

positive_vocab = [ 'awesome', 'outstanding', 'fantastic', 'terrific', 'good', 'nice', 'great', ':)' ]
negative_vocab = [ 'bad', 'terrible','useless', 'hate', ':(' ]
neutral_vocab = [ 'movie','the','sound','was','is','actors','did','know','words','not','it','so','really' ]

def word_feats(words):
    return dict([(word, True) for word in words])

positive_features_1 = [(word_feats(positive_vocab), 'pos')]
negative_features_1 = [(word_feats(negative_vocab), 'neg')]
neutral_features_1 = [(word_feats(neutral_vocab), 'neu')]

train_set = negative_features_1 + positive_features_1 + neutral_features_1

classifier = NaiveBayesClassifier.train(train_set) 

# Predict
neg = 0
pos = 0
sentence = "Awesome movie. I like it. It is so bad."
sentence = sentence.lower()
sentences = sentence.split('.')   # these are actually list of sentences

for sent in sentences:
    if sent != "":
        words = [word for word in sent.split(" ") if word not in stopwords.words('english')]
        classResult = classifier.classify(word_feats(words))
        if classResult == 'neg':
            neg = neg + 1
        if classResult == 'pos':
            pos = pos + 1
        print(str(sent) + ' --> ' + str(classResult))
        print

I modified where you are considering 'list of words' as an input to your classifier. But Actually you need to pass sentence one by one, which means you need to pass 'list of sentences'

Also, for each sentence, you need to pass 'words as features', which means you need to split the sentence on white-space character.

Also, if you want your classifier to work properly for sentiment analysis, you need to give less preference to "stop-words" like "it, they, is etc". As these words are not sufficient to decide if the sentence is positive, negative or neutral.

The above code gives below output

awesome movie --> pos

 i like it --> pos

 it is so bad --> neg

So for any classifier, the input format for training classifier and predicting classifier should be same. While training you are providing list of words, try to use the same method to convert your test set as well.

Thank you @Gunjan. This helps me a lot. If I am correct, I think one of the problems with my original script (in addition to other errors) was that I had passed individual sentences to 'word_feats' instead of individual words, and that confused the ML model and made it ineffective in classifying the correct sentiments. — Stanleyrr, Jan 22 '18 at 23:52
@Stanleyrr : yes, so basically when you say you are passing words, you are actually converting your sentence into list of features (in our case features are words). In ML model, your model will work totally on the features you will provide. Also removing stop-words makes your features(words) more refined. This also affects your output, as now the Model ignores words like "it","so". — Gunjan, Jan 23 '18 at 05:21

score 1 · Answer 4 · answered Jul 23 '18 at 12:22

You can try this code

from nltk.classify import NaiveBayesClassifier

def word_feats(words):
return dict([(word, True) for word in words])

positive_vocab = [ 'awesome', 'outstanding', 'fantastic','terrific','good','nice','great', ':)','love' ]
negative_vocab = [ 'bad', 'terrible','useless','hate',':(','kill','steal']
neutral_vocab = [ 'movie','the','sound','was','is','actors','did','know','words','not' ]

positive_features = [(word_feats(pos), 'pos') for pos in positive_vocab]
negative_features = [(word_feats(neg), 'neg') for neg in negative_vocab]
neutral_features = [(word_feats(neu), 'neu') for neu in neutral_vocab]

train_set = negative_features + positive_features + neutral_features

classifier = NaiveBayesClassifier.train(train_set) 

# Predict
neg = 0
pos = 0

sentence = " Awesome movie, I like it :)"
sentence = sentence.lower()
words = sentence.split(' ')
for word in words:
classResult = classifier.classify( word_feats(word))
if classResult == 'neg':
    neg = neg + 1
if classResult == 'pos':
    pos = pos + 1


print('Positive: ' + str(float(pos)/len(words)))
print('Negative: ' + str(float(neg)/len(words)))

results are: Positive: 0.7142857142857143 Negative: 0.14285714285714285

Why did NLTK NaiveBayes classifier misclassify one record?

4 Answers4

Linked