0

I'm running cross validation on logistic regression, and I've run into a strange issue where the train and test accuracy are all 100% except for the very first and second fold, which are about 66% accuracy. 100% accuracy is definitely wrong and I am expecting accuracies more in the 60s-70s range, so only the first and second folds align with my expectations.

I manually created train/val folds for cross validation, and I used sklearn's logistic regression across all my folds. I've checked and rechecked how I created the folds and the data, and everything seems to have been processed correctly. I also reinitialize the model before I train/evaluate every fold, so it's not possible that the model improves upon each fold. The proportion of the positive and negative class are what I expect in each fold. Even if train accuracy was high, I wouldn't expect test accuracy to be high as well. Anyone know what might be happening here and have suggestions on what I should look into? I am not sure what is happening

Thanks!

results = []
for i in range(len(TRAIN_FOLDS)):
    # i = 2
    train_fp = os.path.join(TRAIN_DIR, TRAIN_FOLDS[i])
    val_fp = os.path.join(VAL_DIR, VAL_FOLDS[i])
    print("RUNNING:", train_fp)
    wes_data_train = sc.read_h5ad(train_fp)
    print("training data shape:", wes_data_train.X.shape)
    wes_data_val = sc.read_h5ad(val_fp)
    print("val data shape:", wes_data_val.X.shape)
    print("proportion responders:", (wes_data_val.obs['response'] == 1).sum() / len(wes_data_val.obs['response']))

    # fit and score the data
    lr = LogisticRegression()
    lr.fit(wes_data_train.X, wes_data_train.obs['response'])
    coeff_df = pd.DataFrame(lr.coef_, columns=wes_data_train.var.features)
    # print("coeffs:", lr.coef_)
    print("train acc:", lr.score(wes_data_train.X, wes_data_train.obs['response']))
    print("test acc:", lr.score(wes_data_val.X, wes_data_val.obs['response']))

Results for every fold

1 Answers1

0

This might not solve your question but I just wanted point out few things. Generally Logistic regression algorithms doesn't give 100% accuracy even if you use the same training data.
And it is not recommended to reinitialize your model each time you evaluate on a particular fold because you want to know how your model performs on each of the folds with the same weights and this helps you in deciding which would be your final model.
If you reinitialize every time there is no point in doing cross validation as it is using different weights each time.
This link might help you : Different results between cross_validate() and my own cross validation function

Nandan
  • 16
  • 1