I am fairly new to machine learning and have been tasked with building a machine learning modelt o predict whether a review is good (1) or bad (0). I have already tried to use a RandomForestClassifier
that output an accuracy of 50%. I switched to the Naive Bayes classifier but am still not getting any improvement, even after conducting a grid search.
My data looks as such (I am happy to share with data with anyone):
Reviews Labels
0 For fans of Chris Farley, this is probably his... 1
1 Fantastic, Madonna at her finest, the film is ... 1
2 From a perspective that it is possible to make... 1
3 What is often neglected about Harold Lloyd is ... 1
4 You'll either love or hate movies such as this... 1
... ...
14995 This is perhaps the worst movie I have ever se... 0
14996 I was so looking forward to seeing this film t... 0
14997 It pains me to see an awesome movie turn into ... 0
14998 "Grande Ecole" is not an artful exploration of... 0
14999 I felt like I was watching an example of how n... 0
[15000 rows x 2 columns]
My code to preprocess then text and use TfidfVectorizer
before training the classifier is as such:
vect = TfidfVectorizer(stop_words=stopwords, max_features=5000)
X_train =vect.fit_transform(all_train_set['Reviews'])
y_train = all_train_set['Labels']
clf = MultinomialNB()
clf.fit(X_train, y_train)
X_test = vect.transform(all_test_set['Reviews'])
y_test = all_test_set['Labels']
print(classification_report(y_test, clf.predict(X_test), digits=4))
The results of the classification report seem to indicate that whilst one label is predicted very well, the other is extremly poor, bringing the whole thing down.
precision recall f1-score support
0 0.5000 0.8546 0.6309 2482
1 0.5000 0.1454 0.2253 2482
accuracy 0.5000 4964
macro avg 0.5000 0.5000 0.4281 4964
weighted avg 0.5000 0.5000 0.4281 4964
I have tried to follow 8 different tutorials now on this and tried each different way of coding but I can't seem to get it above 50% which makes me think it may be a problem with my features.
If anyone has any idea or suggestions, I'd greatly appreciate it.
EDIT: Okay so I have added several preprocessing steps here including, removing html tags, removing punctuation and single letter and removing multiple spaces from the code below:
TAG_RE = re.compile(r'<[^>]+>')
def remove_tags(text):
return TAG_RE.sub('', text)
def preprocess_text(sen):
# Removing html tags
sentence = remove_tags(sen)
# Remove punctuations and numbers
sentence = re.sub('[^a-zA-Z]', ' ', sentence)
# Single character removal
sentence = re.sub(r"\s+[a-zA-Z]\s+", ' ', sentence)
# Removing multiple spaces
sentence = re.sub(r'\s+', ' ', sentence)
return sentence
I believe TfidfVectorizer
automatically puts everything in lower case and lemmatizes it. The end result is still only 0.5