0

I created a classification model using Random forest. To validate the model i am using K-Fold method with 10 splits and measuring model performance by f1-score. when i perform this i am having very less f1-score for the first few folds and very high f1-score for the rest of the folds.

i am expecting same range of score in each split.

code:

from sklearn.ensemble.forest import RandomForestClassifier
from sklearn.model_selection._split import KFold
kf = KFold(n_splits=20,random_state=41) 

f1list = []

for train_index, test_index in kf.split(XX):
    print("Train:", train_index, "Validation:",test_index)
    X_train, X_test = XX[train_index], XX[test_index] 
    Y_train, Y_test = YY[train_index], YY[test_index]
    LR1 = RandomForestClassifier(n_estimators=10,criterion='entropy',random_state=1,max_depth=25,warm_start=True,bootstrap=True, oob_score=True,n_jobs=-1)

    model1 = LR1.fit(X_train,Y_train)
    pred1 = model1.predict(X_test)

    from sklearn.metrics import f1_score

    f1list.append(f1_score(pred1,Y_test))

and the list of f1-score for 10 splits is

[0.3659305993690852, 0.32, 0.3440860215053763, 0.3668639053254438, 0.4183381088825215, 0.9969525468001741, 0.9979652345793849, 0.9984892504357932, 0.9980234856412045, 0.9977904407489243]
LUZO
  • 1,019
  • 4
  • 19
  • 42

1 Answers1

0

The code seems to be correct to me, so the problem could be on your data. The problem here is that results depend heavily on the partition... you could try the following:

  1. Check that you've got enough data to make a 20-fold CV. Maybe you could consider less folds.
  2. Shuffle data. Is a good practice, as explained here.
  3. Repeat the CV several times. For having a single metric, you can average the f1-score for every split, and then averege each cv's f1-score average.

Let me know if it works!

Community
  • 1
  • 1
alexdefelipe
  • 169
  • 2
  • 8