-1

I am trying to do classification with machine learning. I have "good" and "bad" classes in my dataset.

Dataset shape: (248857, 12)

Due to some conditions, I am not able to collect more "good" class results, there are around 40k good, and 210k bad results. Is that an issue more with the models?

I trained the model in this way: (as an example I used here Naive Bayes but I use KNN, SVM, MLP, Random Forest, and Decision Tree as well)

X = df.drop(['Label'], axis=1)
y = df['Label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.33, random_state = 42)
classifier = GaussianNB()  
classifier.fit(X_train, y_train)  
y_predNaive = classifier.predict(X_test)  
print(f'Test score {accuracy_score(y_predNaive,y_test)}')
plot_confusionmatrix(y_predNaive,y_test,dom='Test')
print('Classification Report for Naive Bayes\n\n', classification_report(y_test, y_predNaive))
desertnaut
  • 57,590
  • 26
  • 140
  • 166

1 Answers1

0

There are multiple ways to deal with this issue. You can change the scoring metric to something like F-score or other metrics. Alternatively you can randomly remove 170k bad samples so your classes are equal. Furthermore random forests are pretty good at dealing with imbalanced datasets so you might be able to skip the preprocessing by just sticking with a random forest.

  • I used Random Forest, SVM, KNN, MultiLayerP, Decision Tree, Naive Bayes. I wanted to see all the results. I just wanted to be sure that if this ration of my dataset would not create a problem. I used confusion matrix and classification report. – linuxgakgos Jun 26 '22 at 19:21
  • The best way to check this would be running it with an imbalanced set and checking the confusion matrix, then comparing that to the confusion matrix with the balanced set. – joshniemela Jun 26 '22 at 22:38