-1

I am fairly new to machine learning and have been tasked with building a machine learning modelt o predict whether a review is good (1) or bad (0). I have already tried to use a RandomForestClassifier that output an accuracy of 50%. I switched to the Naive Bayes classifier but am still not getting any improvement, even after conducting a grid search.

My data looks as such (I am happy to share with data with anyone):

                                                 Reviews  Labels
0      For fans of Chris Farley, this is probably his...       1
1      Fantastic, Madonna at her finest, the film is ...       1
2      From a perspective that it is possible to make...       1
3      What is often neglected about Harold Lloyd is ...       1
4      You'll either love or hate movies such as this...       1
                                              ...     ...
14995  This is perhaps the worst movie I have ever se...       0
14996  I was so looking forward to seeing this film t...       0
14997  It pains me to see an awesome movie turn into ...       0
14998  "Grande Ecole" is not an artful exploration of...       0
14999  I felt like I was watching an example of how n...       0
[15000 rows x 2 columns]

My code to preprocess then text and use TfidfVectorizer before training the classifier is as such:

vect = TfidfVectorizer(stop_words=stopwords, max_features=5000)
X_train =vect.fit_transform(all_train_set['Reviews'])
y_train = all_train_set['Labels']

clf = MultinomialNB()
clf.fit(X_train, y_train)

X_test = vect.transform(all_test_set['Reviews'])
y_test = all_test_set['Labels']

print(classification_report(y_test, clf.predict(X_test), digits=4))

The results of the classification report seem to indicate that whilst one label is predicted very well, the other is extremly poor, bringing the whole thing down.

              precision    recall  f1-score   support
           0     0.5000    0.8546    0.6309      2482
           1     0.5000    0.1454    0.2253      2482
    accuracy                         0.5000      4964
   macro avg     0.5000    0.5000    0.4281      4964
weighted avg     0.5000    0.5000    0.4281      4964

I have tried to follow 8 different tutorials now on this and tried each different way of coding but I can't seem to get it above 50% which makes me think it may be a problem with my features.

If anyone has any idea or suggestions, I'd greatly appreciate it.

EDIT: Okay so I have added several preprocessing steps here including, removing html tags, removing punctuation and single letter and removing multiple spaces from the code below:

TAG_RE = re.compile(r'<[^>]+>')

def remove_tags(text):
    return TAG_RE.sub('', text)

def preprocess_text(sen):
    # Removing html tags
    sentence = remove_tags(sen)
    # Remove punctuations and numbers
    sentence = re.sub('[^a-zA-Z]', ' ', sentence)
    # Single character removal
    sentence = re.sub(r"\s+[a-zA-Z]\s+", ' ', sentence)
    # Removing multiple spaces
    sentence = re.sub(r'\s+', ' ', sentence)
    return sentence

I believe TfidfVectorizer automatically puts everything in lower case and lemmatizes it. The end result is still only 0.5

geds133
  • 1,503
  • 5
  • 20
  • 52

1 Answers1

2

Text preprocessing is very important here. Removal of stop words only is not enough, I think you should consider the following:

  • convert the text to lowercase
  • removal of punctuation
  • Apostrophe lookup ("'ll" -> " will"', "'ve" -> " have")
  • Removal of numbers
  • lemmatization and/or stemming for the reviews
  • etc.

Have a look at the text preprocessing methods.

Sami Belkacem
  • 336
  • 3
  • 12
  • Please see above. I believe `TfidfVectorizer` automatically puts everything in lower case and lemmatizes it – geds133 Jan 10 '20 at 11:34
  • Please see above. I believe `TfidfVectorizer` automatically puts everything in lower case and lemmatizes it – geds133 Jan 10 '20 at 11:35