I would like to generate multiple test data splits using stratified KFold (skf) and then generate/assemble predictions for each of these test data splits (and hence all of the data) using a sklearn model. I am at a wits end on how to do this programmatically.
I have recaptured my code using a minimal data example below. Briefly, (after data import), I have a function that does the model fit and generates model predicted probabilities. Subsequently, I attempt to pass this function to each skf split of my data so as to generate and subsequently collate predicted probabilities for each row of my data. However, this step fails and generates a valueerror (boolean array expected). My code follows below:
from sklearn import datasets
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import StratifiedKFold
#load data, assemble dataframe
iris = datasets.load_iris()
X = pd.DataFrame(iris.data[51:150, :], columns = ["sepal_length", "sepal_width",
"petal_length", "petal_width"])
y = pd.DataFrame(iris.target[51:150,], columns = ["target"])
df = pd.concat([X,y], axis = 1)
#instantiate logistic regression
log = LogisticRegression()
#modelling function
def train_model(train, test, fold):
X = df.drop("target", axis = 1)
y = df["target"]
X_train = train[X]
y_train = train[y]
X_test = test[X]
y_test = test[y]
#generate probability of class 1 predictions from logistic regression model fit
prob = log.fit(X_train, y_train).predict_proba(X_test)[:, 1]
return (prob)
#generate straified k-fold splits (2 used as example here)
skf = StratifiedKFold(n_splits = 2)
#generate and collate all predictions (for each row in df)
fold = 1
outputs = []
for train_index, test_index in skf.split(df, y):
train_df = df.loc[train_index,:]
test_df = df.loc[test_index,:]
output = train_model(train_df,test_df,fold) #generate model probabilities for X_test
in skf split
outputs.append(output) #append all model probabilities
fold = fold + 1
all_preds = pd.concat(outputs)
Can somebody please guide me to the solution that includes row index and its predicted probability?