1

I am making Sentiment Analysis Classification and I am doing it with Scikit-learn. This has 3 labels, positive, neutral and negative. The Shape of my training data is (14640, 15), where

negative    9178
neutral     3099
positive    2363

I have pre-processed the data and applied the bag-of-words word vectorization technique to the text of twitter as there many other attributes too, whose size is then (14640, 1000). As the Y, means the label is in the text form so, I applied LabelEncoder to it. This is how I split my dataset -

X_train, X_test, Y_train, Y_test = train_test_split(bow, Y, test_size=0.3, random_state=42)
print(X_train.shape,Y_train.shape)
print(X_test.shape,Y_test.shape)

out: (10248, 1000) (10248,)
     (4392, 1000) (4392,)

And this is my classifier

svc = svm.SVC(kernel='linear', C=1, probability=True).fit(X_train, Y_train) 
prediction = svc.predict_proba(X_test) 
prediction_int = prediction[:,1] >= 0.3 
prediction_int = prediction_int.astype(np.int) 
print('Precision score: ', precision_score(Y_test, prediction_int, average=None))
print('Accuracy Score: ', accuracy_score(Y_test, prediction_int))

out:Precision score:  [0.73980398 0.48169243 0.        ]
Accuracy Score:  0.6675774134790529
/usr/local/lib/python3.6/dist-packages/sklearn/metrics/classification.py:1437: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)

Now I am not sure why the third one, in precision score is blank? I have applied average=None, because to make a separate precision score for every class. Also, I am not sure about the prediction, if it is right or not, because I wrote it for binary classification? Can you please help me to debug it to make it better. Thanks in advance.

desertnaut
  • 57,590
  • 26
  • 140
  • 166

1 Answers1

1

As the warning explains:

UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples.

it seems that one of your 3 classes is missing from your predictions prediction_int (i.e. you never predict it); you can easily check if this is the case with

set(Y_test) - set(prediction_int)

which should be the empty set {} if this is not the case.

If this is indeed the case, and the above operation gives {1} or {2}, the most probable reason is that your dataset is imbalanced (you have much more negative samples), and you do not ask for a stratified split; modify your train_test_split to

X_train, X_test, Y_train, Y_test = train_test_split(bow, Y, test_size=0.3, stratify=Y, random_state=42)

and try again.

UPDATE (after comments):

As it turns out, you have a class imbalance problem (and not a coding issue) which prevents your classifier from successfully predicting your 3rd class (positive). Class imbalance is a huge sub-topic in itself, and there are several remedies proposed. Although going into more detail is arguably beyond the scope of a single SO thread, the first thing you should try (on top of the suggestions above) is to use the class_weight='balanced' argument in the definition of your classifier, i.e.:

svc = svm.SVC(kernel='linear', C=1, probability=True, class_weight='balanced').fit(X_train, Y_train) 

For more options, have a look at the dedicated imbalanced-learn Python library (part of the scikit-learn-contrib projects).

desertnaut
  • 57,590
  • 26
  • 140
  • 166
  • did that, but when I do `set(Y_test) - set(prediction_int)`, gives me back `{2}`, so your words so true. Next I made `X_train, X_test, Y_train, Y_test = train_test_split(bow, Y, test_size=0.3, random_state=42)` to `X_train, X_test, Y_train, Y_test = train_test_split(bow, Y, test_size=0.3, stratify=Y, random_state=42)` and run that, but still no result, same as before, like last precision is missing and `{2}`. Do you think anything is wrong with my prediction? – Deb Prakash Chatterjee Aug 07 '19 at 16:02
  • @DebPrakashChatterjee that's because in fact you don't have any *coding* issue (which arguably SO is all about), but a *data* issue (class imbalance). Notice that I just advised "try again" (and not "it should work", or something) - i.e. what I wrote is the correct action to do but still doesn't mean that it is *sufficient* for your imbalanced data... – desertnaut Aug 07 '19 at 16:28
  • That's pretty clear, but is there any option to do? I understand that this might fail, but I still wanna do this, I am near about completing my final project. So, I am requesting, is there any chance to succeed? – Deb Prakash Chatterjee Aug 07 '19 at 17:03
  • I got it, I have printed the `prediction_int`, and it turns out that, it is not printing the third class `[0 0 0 ... 1 0 0]`. Pretty problematic. – Deb Prakash Chatterjee Aug 07 '19 at 17:09
  • @DebPrakashChatterjee you mean, after using also `class_weights='balanced'`? – desertnaut Aug 07 '19 at 17:11
  • yes, after using `class_weights='balanced'`. I have updated my code too. please check, – Deb Prakash Chatterjee Aug 07 '19 at 17:14
  • @DebPrakashChatterjee pls do not alter the code after an answer has been provided - it makes the response look irrelevant (restored the previous edit); you have now arguably narrowed down the cause of your issue, and you may want to now open a **new** more focused question – desertnaut Aug 07 '19 at 17:18
  • this is the new thread link - https://stackoverflow.com/questions/57401272/how-to-predict-all-classes-in-a-multi-class-sentiment-analysis-problem-using-svm/57401558#57401558 – Deb Prakash Chatterjee Aug 07 '19 at 20:43