0

I would like to collect the pipeline creation, KFold, and cross_val_score inside a for-loop; then iterate over different strategies in a list and different algorithms in a list.

What I did right now:

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

from sklearn.pipeline import Pipeline
from sklearn.model_selection import RepeatedStratifiedKFold, cross_val_score

from sklearn.impute import SimpleImputer

import pandas as pd
import numpy as np

df = pd.read_csv("https://raw.githubusercontent.com/subashgandyer/datasets/main/heart_disease.csv")
X = df[df.columns[:-1]]
y = df[df.columns[-1]]

results =[]

strategies = ['mean', 'median', 'most_frequent','constant']

lr = LogisticRegression()
knn = KNeighborsClassifier()
rf = RandomForestClassifier()
svc = SVC()
models = [lr, knn, rf, svc]

for s in strategies:
    for m in models:
        pipeline = Pipeline([('impute', SimpleImputer(strategy=s)),('model',m)])
        cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
        scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
    
    results.append(scores)
    
for a,b,c in zip(strategies,models,results):
    print(f"Strategy: {a} and model:{b} >> Accuracy: {round(np.mean(c),3)}   |   Max accuracy: {round(np.max(c), 3)}")

It seems like the code runs with no issues. But then my problems are:

  1. Why the whole loop cannot iterate 16 times (mean & lr, mean & knn, ...)? Instead it runs 4 times like: mean & lr, median & knn.

  2. To understand how the code iterates, I tried to put only a few in the model list. Then I find that no matter how I change the model list, the "results" is always a list of 4 arrays, and each has 30 elements. Each 30 elements will be the source of np.mean or np.max after then. So how should I modify my for loops to make it work?

Alexander L. Hayes
  • 3,892
  • 4
  • 13
  • 34
resssslll
  • 65
  • 1
  • 7
  • 1
    `results.append(scores)` should be indented to the level of the prior block – Nick Sep 26 '22 at 08:23
  • I see two non-inner loops: which is "the whole loop"? You may be looking for [`for (a, b), c in zip(itertools.product(strategies, models), results)`](https://docs.python.org/3/library/itertools.html#itertools.product), but: Why collect scores in `results` to just print accuracies one by one? – greybeard Sep 26 '22 at 10:20

0 Answers0