I'm running cross validation on logistic regression, and I've run into a strange issue where the train and test accuracy are all 100% except for the very first and second fold, which are about 66% accuracy. 100% accuracy is definitely wrong and I am expecting accuracies more in the 60s-70s range, so only the first and second folds align with my expectations.
I manually created train/val folds for cross validation, and I used sklearn's logistic regression across all my folds. I've checked and rechecked how I created the folds and the data, and everything seems to have been processed correctly. I also reinitialize the model before I train/evaluate every fold, so it's not possible that the model improves upon each fold. The proportion of the positive and negative class are what I expect in each fold. Even if train accuracy was high, I wouldn't expect test accuracy to be high as well. Anyone know what might be happening here and have suggestions on what I should look into? I am not sure what is happening
Thanks!
results = []
for i in range(len(TRAIN_FOLDS)):
# i = 2
train_fp = os.path.join(TRAIN_DIR, TRAIN_FOLDS[i])
val_fp = os.path.join(VAL_DIR, VAL_FOLDS[i])
print("RUNNING:", train_fp)
wes_data_train = sc.read_h5ad(train_fp)
print("training data shape:", wes_data_train.X.shape)
wes_data_val = sc.read_h5ad(val_fp)
print("val data shape:", wes_data_val.X.shape)
print("proportion responders:", (wes_data_val.obs['response'] == 1).sum() / len(wes_data_val.obs['response']))
# fit and score the data
lr = LogisticRegression()
lr.fit(wes_data_train.X, wes_data_train.obs['response'])
coeff_df = pd.DataFrame(lr.coef_, columns=wes_data_train.var.features)
# print("coeffs:", lr.coef_)
print("train acc:", lr.score(wes_data_train.X, wes_data_train.obs['response']))
print("test acc:", lr.score(wes_data_val.X, wes_data_val.obs['response']))