Evaluating logistic regression using cross validation and ROC

Question

I am trying to evaluate logistic regression using the AUROC curve and and cross-validate my scores. When I don't cross-validate I have no issues, but I really want to use cross validation to help decrease bias in my method.

Anyway, below is the code and error term I get for the beginning part of my code:

X = df.drop('Survived', axis=1)
y = df['Survived']

skf = StratifiedKFold(n_splits=5)
logmodel = LogisticRegression()

i=0
for train, test in skf.split(X,y):
    logmodel.fit(X[train], y[train])   # error occurs here
    predictions = logmodel.predict_proba(X[test])
    # a bunch of code that I haven't included which creates the ROC curve
    i += 1

The error occurs in the fourth to last line, and returns a list of integers followed by 'not in index'

I don't really understand what the problem is?

This is my understanding of the code: First I create an instance of both stratified kfold and logistic regression. The instance of stratified kfold states that five folds are to be made. Next, I say that for each train and test fold in my dataset X, y I fit the logistic model to the data and then create a list of predictions for different probabilities based on the test data. Later (this part is not showed) I will create a ROC curve for each k-fold of data.

Again, I don't really understand what the problem is but maybe somebody can clarify. My work is more or less copied directly from this link in sklearn: https://scikit-learn.org/stable/auto_examples/model_selection/plot_roc_crossval.html#sphx-glr-auto-examples-model-selection-plot-roc-crossval-py

Please see guidelines on [how to ask](https://stackoverflow.com/help/how-to-ask) a question. _Please attach the exact error line so we can see the error_. Also, your title is misleading - you simple have error in the logistic model, not in the validation step (which also is not included here) and you just present what you do know, no questions there. Do you expect to verify your knowledge? — mr_mo, Nov 17 '18 at 23:37
X, y are pandas objects. You need to use `iloc` to access the elements. See [sklearn TimeSeriesSplit Error: KeyError: '\[ 0 1 2 ...\] not in index'](https://stackoverflow.com/questions/51597507/sklearn-timeseriessplit-error-keyerror-0-1-2-not-in-index). Or else you can first convert the pandas objects into numpy array as the answer by @mr_mo suggests. — Vivek Kumar, Nov 19 '18 at 07:22

score 0 · Answer 1 · answered Nov 17 '18 at 23:48

Please add more details so it can be truly examined. Preferably (and actually required) a piece of code that one can run to see the error.

From first view, you take a pandas dataframe and feed it into the model, and that is done incorrect. See the following lines that are correct for retrieving data and feeding it to the model:

X = df.drop('Survived', axis=1).values
y = df['Survived'].values

The .values suffix accesses the numpy data object that is stored in those dataframes, which is consistent with the rest of the code.

Hopefully that helps you to solve the error.

Good luck!

Evaluating logistic regression using cross validation and ROC

1 Answers1