1

I have a model whose training accuracy is 95-100 % and I believe there is overfitting. So, I want to avoid overfitting in my model. One way to avoid overfitting is to do k-fold cross-validation. So, while performing cross-validation there are several results for each iteration. So, how to choose the best result from different results and predict unseen data?


from sklearn.model_selection import train_test_split
train_features, test_features, train_labels, test_labels = train_test_split(features, labels, test_size = 0.25, random_state = 42)


from sklearn.ensemble import RandomForestClassification

rf = RandomForestClassification(random_state = 42)
rf.fit(train_features, train_labels)

predictions = rf.predict(test_features)

Cross-validation sample from sklearn is

from sklearn.model_selection import cross_val_score
clf = RandomForestClassification(random_state = 42)
scores = cross_val_score(clf, X, y, cv=5)
Bad Coder
  • 177
  • 11
  • If you're over fitting on your training data, your accuracy on test data will fall and you may consider modifying your model parameters or the test/train split ratio. – AndyG Nov 05 '22 at 21:03
  • So, what should I do differently in above code rather than having different test/train split ratio? – Bad Coder Nov 07 '22 at 19:58
  • I recommend you play with the parameters of your random forest. E.g. Max depth. You can even automate it to search for the one that results in the highest accuracy on your test data. But first you should verify you really are overfitting – AndyG Nov 08 '22 at 12:12
  • how can I verify overfitting? – Bad Coder Nov 09 '22 at 22:11

0 Answers0