0

I've been struggling for a while with this so I thought I would ask here.

I have a dataset with some missing values so I wanted to use KNNImputer to fill them in. To test the validity of the features, for a range of k-values I used a RandomForest Classifier to predict a binary target variable (Sample Method). The data contains groupings (Sample Id), and the target variable is very imbalanced, so I've used StratifiedGroupKfold for cross-validation. I then want to fit the pipeline onto X and Y so that I can get a complete dataset (without missing values), using the highest average classification accuracy between folds. With the code below, I get a classification accuracy of 90%.

What I want to understand is if pipeline.fit currently takes into consideration the train/test splits of StratifiedGroupKfold, or if the model is actually fitting on the entire dataset (hence overfitting). I tried putting "groups" in pipeline.fit(x,y) but it returns an error.

Any help would be appreciated! Thanks

x = preproc_data.drop(['Sampling Method',"Sample Id"], 1)
# Assign the 'Sampling Method' column from the preprocessed data to 'y'
y = preproc_data["Sampling Method"]

groups = preproc_data["Id"]

# Initialize an empty list to store the scores for each k value
score_results = []
# Initialize an empty dictionary to store the average score for each k value
average_score_fold = {}
# Initialize a variable to store the maximum average score
maximum = 0
# Set a random state for reproducibility
rng = np.random.RandomState(0)
# Define a list of k values to iterate over
k_values = [k for k in range(2, 201, 5)]

# Iterate over each k value
for k in k_values:
    # Create the modeling pipeline with KNN imputation and Random Forest classification
    pipeline = Pipeline(steps=[('i', KNNImputer(n_neighbors=int(k))), ('m', RandomForestClassifier(random_state=rng))])
    # Evaluate the model using StratifiedGroupKFold cross-validation
    cv = StratifiedGroupKFold(n_splits=10)
    scores = cross_val_score(pipeline, x, y, scoring="accuracy", groups=groups, cv=cv, n_jobs=-1)
    # Calculate the average score for the current k value and store it in the dictionary
    average_score_fold[k] = round(np.mean(scores), 3)
    # Append the scores to the score_results list
    score_results.append(scores)
    # Find the k value with the highest average score
    k_highest_average_score = max(average_score_fold, key=average_score_fold.get)
    # Update the maximum average score and store the imputed results if a higher score is found
    if maximum < int(k_highest_average_score):
        maximum = int(k_highest_average_score)
        # Fit the pipeline on the entire dataset and transform the features using KNN imputation
        pipeline.fit(x, y)
        imputed_results = pipeline.named_steps['i'].transform(x)
        imputed_df = pd.DataFrame(imputed_results, columns=x.columns)```
Adorable
  • 119
  • 1
  • 1
  • 6

0 Answers0