I've been struggling for a while with this so I thought I would ask here.
I have a dataset with some missing values so I wanted to use KNNImputer to fill them in. To test the validity of the features, for a range of k-values I used a RandomForest Classifier to predict a binary target variable (Sample Method). The data contains groupings (Sample Id), and the target variable is very imbalanced, so I've used StratifiedGroupKfold for cross-validation. I then want to fit the pipeline onto X and Y so that I can get a complete dataset (without missing values), using the highest average classification accuracy between folds. With the code below, I get a classification accuracy of 90%.
What I want to understand is if pipeline.fit currently takes into consideration the train/test splits of StratifiedGroupKfold, or if the model is actually fitting on the entire dataset (hence overfitting). I tried putting "groups" in pipeline.fit(x,y)
but it returns an error.
Any help would be appreciated! Thanks
x = preproc_data.drop(['Sampling Method',"Sample Id"], 1)
# Assign the 'Sampling Method' column from the preprocessed data to 'y'
y = preproc_data["Sampling Method"]
groups = preproc_data["Id"]
# Initialize an empty list to store the scores for each k value
score_results = []
# Initialize an empty dictionary to store the average score for each k value
average_score_fold = {}
# Initialize a variable to store the maximum average score
maximum = 0
# Set a random state for reproducibility
rng = np.random.RandomState(0)
# Define a list of k values to iterate over
k_values = [k for k in range(2, 201, 5)]
# Iterate over each k value
for k in k_values:
# Create the modeling pipeline with KNN imputation and Random Forest classification
pipeline = Pipeline(steps=[('i', KNNImputer(n_neighbors=int(k))), ('m', RandomForestClassifier(random_state=rng))])
# Evaluate the model using StratifiedGroupKFold cross-validation
cv = StratifiedGroupKFold(n_splits=10)
scores = cross_val_score(pipeline, x, y, scoring="accuracy", groups=groups, cv=cv, n_jobs=-1)
# Calculate the average score for the current k value and store it in the dictionary
average_score_fold[k] = round(np.mean(scores), 3)
# Append the scores to the score_results list
score_results.append(scores)
# Find the k value with the highest average score
k_highest_average_score = max(average_score_fold, key=average_score_fold.get)
# Update the maximum average score and store the imputed results if a higher score is found
if maximum < int(k_highest_average_score):
maximum = int(k_highest_average_score)
# Fit the pipeline on the entire dataset and transform the features using KNN imputation
pipeline.fit(x, y)
imputed_results = pipeline.named_steps['i'].transform(x)
imputed_df = pd.DataFrame(imputed_results, columns=x.columns)```