NLP - Worse result when adding stemming or lemmitization for Sentiment Analysis

Question

I'm trying to create a full pipeline of results for sentiment analysis for a smaller subset of the IMDB reviews (only 2k pos, 2k neg) so I'm tryna show results at each stage

i.e. without any pre-processing, then basic cleaning (remove specials, stopwords, lowercasing) then testing both stemming and lemmitization (seperately) on top of the basic cleaning.

After basic cleaning I'm jumping from 50% (only binary classification so makes sense) to mid-to-low 80%'s. Then after adding stemming and lemming, it either doesn't change or for random forest gets the recall below 80%.

Why's this the case? Are my results normal? If so how do you justify using either one?

Also to note all of the models and feature extractions are using default parameters from sklearn so I haven't gotten to the model optimization part, should I try that for these 3 cases and then see if they perform worse?

Feature Extractions: Bag of Words and TF-Idf

Models: SVM, Logistic Regression, Multinomial Naive Bayes and Random Forest

Results:

Basic Cleaning (remove specials, stopwords, lowercasing)

SVM BOW
              precision    recall  f1-score   support

    Positive       0.85      0.85      0.85       530
    Negative       0.83      0.83      0.83       470

    accuracy                           0.84      1000
   macro avg       0.84      0.84      0.84      1000
weighted avg       0.84      0.84      0.84      1000


SVM TF-IDF
              precision    recall  f1-score   support

    Positive       0.85      0.88      0.86       530
    Negative       0.86      0.83      0.84       470

    accuracy                           0.85      1000
   macro avg       0.86      0.85      0.85      1000
weighted avg       0.86      0.85      0.85      1000


LR BOW
              precision    recall  f1-score   support

    Positive       0.87      0.85      0.86       530
    Negative       0.83      0.85      0.84       470

    accuracy                           0.85      1000
   macro avg       0.85      0.85      0.85      1000
weighted avg       0.85      0.85      0.85      1000


LR TF-IDF
              precision    recall  f1-score   support

    Positive       0.89      0.82      0.85       530
    Negative       0.81      0.88      0.84       470

    accuracy                           0.85      1000
   macro avg       0.85      0.85      0.85      1000
weighted avg       0.85      0.85      0.85      1000


MNB BOW
              precision    recall  f1-score   support

    Positive       0.83      0.85      0.84       530
    Negative       0.82      0.81      0.82       470

    accuracy                           0.83      1000
   macro avg       0.83      0.83      0.83      1000
weighted avg       0.83      0.83      0.83      1000


MNB TF-IDF
              precision    recall  f1-score   support

    Positive       0.86      0.84      0.85       530
    Negative       0.82      0.85      0.83       470

    accuracy                           0.84      1000
   macro avg       0.84      0.84      0.84      1000
weighted avg       0.84      0.84      0.84      1000


RFC BOW
              precision    recall  f1-score   support

    Positive       0.85      0.80      0.82       530
    Negative       0.79      0.84      0.81       470

    accuracy                           0.82      1000
   macro avg       0.82      0.82      0.82      1000
weighted avg       0.82      0.82      0.82      1000


RFC TF-IDF
              precision    recall  f1-score   support

    Positive       0.84      0.81      0.83       530
    Negative       0.80      0.83      0.81       470

    accuracy                           0.82      1000
   macro avg       0.82      0.82      0.82      1000
weighted avg       0.82      0.82      0.82      1000

Basic Cleaning + Stemming

SVM BOW
              precision    recall  f1-score   support

    Positive       0.85      0.82      0.83       530
    Negative       0.80      0.83      0.82       470

    accuracy                           0.82      1000
   macro avg       0.82      0.82      0.82      1000
weighted avg       0.82      0.82      0.82      1000


SVM TF-IDF
              precision    recall  f1-score   support

    Positive       0.85      0.85      0.85       530
    Negative       0.83      0.83      0.83       470

    accuracy                           0.84      1000
   macro avg       0.84      0.84      0.84      1000
weighted avg       0.84      0.84      0.84      1000


LR BOW
              precision    recall  f1-score   support

    Positive       0.85      0.83      0.84       530
    Negative       0.81      0.84      0.83       470

    accuracy                           0.83      1000
   macro avg       0.83      0.83      0.83      1000
weighted avg       0.83      0.83      0.83      1000


LR TF-IDF
              precision    recall  f1-score   support

    Positive       0.89      0.81      0.85       530
    Negative       0.80      0.88      0.84       470

    accuracy                           0.84      1000
   macro avg       0.84      0.85      0.84      1000
weighted avg       0.85      0.84      0.84      1000


MNB BOW
              precision    recall  f1-score   support

    Positive       0.83      0.84      0.84       530
    Negative       0.82      0.81      0.82       470

    accuracy                           0.83      1000
   macro avg       0.83      0.83      0.83      1000
weighted avg       0.83      0.83      0.83      1000


MNB TF-IDF
              precision    recall  f1-score   support

    Positive       0.87      0.83      0.85       530
    Negative       0.82      0.86      0.84       470

    accuracy                           0.84      1000
   macro avg       0.84      0.84      0.84      1000
weighted avg       0.84      0.84      0.84      1000


RFC BOW
              precision    recall  f1-score   support

    Positive       0.84      0.77      0.80       530
    Negative       0.76      0.83      0.79       470

    accuracy                           0.80      1000
   macro avg       0.80      0.80      0.80      1000
weighted avg       0.80      0.80      0.80      1000


RFC TF-IDF
              precision    recall  f1-score   support

    Positive       0.83      0.79      0.81       530
    Negative       0.78      0.81      0.80       470

    accuracy                           0.80      1000
   macro avg       0.80      0.80      0.80      1000
weighted avg       0.80      0.80      0.80      1000

Basic Cleaning + Lemmitization

SVM BOW
              precision    recall  f1-score   support

    Positive       0.84      0.83      0.83       530
    Negative       0.81      0.82      0.82       470

    accuracy                           0.83      1000
   macro avg       0.83      0.83      0.83      1000
weighted avg       0.83      0.83      0.83      1000


SVM TF-IDF
              precision    recall  f1-score   support

    Positive       0.85      0.86      0.86       530
    Negative       0.84      0.83      0.84       470

    accuracy                           0.85      1000
   macro avg       0.85      0.85      0.85      1000
weighted avg       0.85      0.85      0.85      1000


LR BOW
              precision    recall  f1-score   support

    Positive       0.86      0.84      0.85       530
    Negative       0.82      0.84      0.83       470

    accuracy                           0.84      1000
   macro avg       0.84      0.84      0.84      1000
weighted avg       0.84      0.84      0.84      1000


LR TF-IDF
              precision    recall  f1-score   support

    Positive       0.88      0.81      0.84       530
    Negative       0.80      0.87      0.84       470

    accuracy                           0.84      1000
   macro avg       0.84      0.84      0.84      1000
weighted avg       0.84      0.84      0.84      1000


MNB BOW
              precision    recall  f1-score   support

    Positive       0.82      0.85      0.83       530
    Negative       0.82      0.80      0.81       470

    accuracy                           0.82      1000
   macro avg       0.82      0.82      0.82      1000
weighted avg       0.82      0.82      0.82      1000


MNB TF-IDF
              precision    recall  f1-score   support

    Positive       0.85      0.83      0.84       530
    Negative       0.81      0.84      0.82       470

    accuracy                           0.83      1000
   macro avg       0.83      0.83      0.83      1000
weighted avg       0.83      0.83      0.83      1000


RFC BOW
              precision    recall  f1-score   support

    Positive       0.84      0.78      0.81       530
    Negative       0.77      0.83      0.80       470

    accuracy                           0.80      1000
   macro avg       0.80      0.81      0.80      1000
weighted avg       0.81      0.80      0.80      1000


RFC TF-IDF
              precision    recall  f1-score   support

    Positive       0.84      0.81      0.82       530
    Negative       0.80      0.82      0.81       470

    accuracy                           0.82      1000
   macro avg       0.82      0.82      0.82      1000
weighted avg       0.82      0.82      0.82      1000

score 1 · Answer 1 · answered Dec 13 '22 at 11:56

1

I would assume the scores you get are as good as they will get using bag of words or tf-idf approaches.

For instance the sentiment doesn't change between "I hated every minute of this movie, the plot was going nowhere" and "I hate every minute of this movie, the plot is go nowhere".

answered Dec 13 '22 at 11:56

Darren Cook

27,837
13
117
217

So for my next step I was planning on picking one of these pre-processing steps due to time and computation resources then doing some parameter tuning for bow/tf-idf and some of the models. Should I just not use lemmitization and stemming and stick with basic cleaing or should I test all of them further then decide? – AdamG Dec 13 '22 at 16:08
@AdamG It depends what your goal is? None of the approaches you are using will beat a transformer model like BERT on sentiment analysis. And you generally feed raw text into a BERT model (because it is using the whole sentence, so you want all those "stop" words, etc.) – Darren Cook Dec 13 '22 at 16:47

NLP - Worse result when adding stemming or lemmitization for Sentiment Analysis

1 Answers1