I would like to collect the pipeline
creation, KFold
, and cross_val_score
inside a for-loop; then iterate over different strategies in a list and different algorithms in a list.
What I did right now:
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline
from sklearn.model_selection import RepeatedStratifiedKFold, cross_val_score
from sklearn.impute import SimpleImputer
import pandas as pd
import numpy as np
df = pd.read_csv("https://raw.githubusercontent.com/subashgandyer/datasets/main/heart_disease.csv")
X = df[df.columns[:-1]]
y = df[df.columns[-1]]
results =[]
strategies = ['mean', 'median', 'most_frequent','constant']
lr = LogisticRegression()
knn = KNeighborsClassifier()
rf = RandomForestClassifier()
svc = SVC()
models = [lr, knn, rf, svc]
for s in strategies:
for m in models:
pipeline = Pipeline([('impute', SimpleImputer(strategy=s)),('model',m)])
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
results.append(scores)
for a,b,c in zip(strategies,models,results):
print(f"Strategy: {a} and model:{b} >> Accuracy: {round(np.mean(c),3)} | Max accuracy: {round(np.max(c), 3)}")
It seems like the code runs with no issues. But then my problems are:
Why the whole loop cannot iterate 16 times (
mean & lr
,mean & knn
, ...)? Instead it runs 4 times like:mean & lr
,median & knn
.To understand how the code iterates, I tried to put only a few in the model list. Then I find that no matter how I change the model list, the "results" is always a list of 4 arrays, and each has 30 elements. Each 30 elements will be the source of
np.mean
ornp.max
after then. So how should I modify my for loops to make it work?