0

I would like to generate multiple test data splits using stratified KFold (skf) and then generate/assemble predictions for each of these test data splits (and hence all of the data) using a sklearn model. I am at a wits end on how to do this programmatically.

I have recaptured my code using a minimal data example below. Briefly, (after data import), I have a function that does the model fit and generates model predicted probabilities. Subsequently, I attempt to pass this function to each skf split of my data so as to generate and subsequently collate predicted probabilities for each row of my data. However, this step fails and generates a valueerror (boolean array expected). My code follows below:

from sklearn import datasets
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import StratifiedKFold

    #load data, assemble dataframe
iris = datasets.load_iris()
X = pd.DataFrame(iris.data[51:150, :], columns = ["sepal_length", "sepal_width", 
"petal_length", "petal_width"])
y = pd.DataFrame(iris.target[51:150,], columns = ["target"])
df = pd.concat([X,y], axis = 1)

    #instantiate logistic regression
log = LogisticRegression()

    #modelling function
def train_model(train, test, fold):
    X = df.drop("target", axis = 1)
    y = df["target"]

    X_train = train[X]
    y_train = train[y]
    X_test = test[X]
    y_test = test[y]

        #generate probability of class 1 predictions from logistic regression model fit
    prob = log.fit(X_train, y_train).predict_proba(X_test)[:, 1]
    return (prob)
  #generate straified k-fold splits (2 used as example here)
skf = StratifiedKFold(n_splits = 2)

   #generate and collate all predictions (for each row in df)
fold = 1
outputs = []
for train_index, test_index in skf.split(df, y):
    train_df = df.loc[train_index,:]
    test_df = df.loc[test_index,:]
    output = train_model(train_df,test_df,fold) #generate model probabilities for X_test 
    in skf split
    outputs.append(output) #append all model probabilities 
    fold = fold + 1  

all_preds = pd.concat(outputs)

Can somebody please guide me to the solution that includes row index and its predicted probability?

veg2020
  • 956
  • 10
  • 27
  • Please, provide the error traceback. It's very useful to try to help you. – Alex Serra Marrugat Jun 29 '22 at 14:24
  • Thanks, yes. The valueerror I am getting is at this step ```output = train_model(train_df,test_df,fold)```. The valueerror says the following: ```ValueError: Boolean array expected for the condition, not float64``` – veg2020 Jun 29 '22 at 14:29
  • I figured out the answer, you can do this via cross_val_predict. Will post the solution later to help any others who may find it useful. – veg2020 Jul 23 '22 at 00:40

0 Answers0