12

SUMMARY

When feeding test and train data into a ROC curve plot, I receive the following error:

KeyError: "None of [Int64Index([ 0, 1, 2, ... dtype='int64', length=1323)] are in the [columns]"

The error seems to be saying that it doesn't like the format of my data, but it worked when run the first time and I haven't been able to get it to run again.

Am I incorrectly splitting my data or sending incorrectly formatted data into my function?

WHAT I'VE TRIED

  • Read through several StackOverflow posts with the same KeyError
  • Re-ead through scikit-learn example I followed
  • Reviewed previous versions of my code to troubleshoot

I am running this within a CoLab document and it can be viewed here

CODE

I am using standard dataframes to pull in my X and Y sets:

X = df_full.drop(['Attrition'], axis=1)
y = df_full['Attrition'].as_matrix()

The KeyError traces back to the 8th line here:

def roc_plot(X, Y, Model):
    tprs = []
    aucs = []
    mean_fpr = np.linspace(0, 1, 100)
    plt.figure(figsize=(12,8))
    i = 0
    for train, test in kf.split(X, Y):
        probas_ = model.fit(X[train], Y[train]).predict_proba(X[test])
        # Compute ROC curve and area the curve
        fpr, tpr, thresholds = roc_curve(Y[test], probas_[:, 1])
        tprs.append(np.interp(mean_fpr, fpr, tpr))
        tprs[-1][0] = 0.0
        roc_auc = auc(fpr, tpr)
        aucs.append(roc_auc)
        plt.plot(fpr, tpr, lw=1, alpha=0.3,
                 label='ROC fold %d (AUC = %0.2f)' % (i, roc_auc))

        i += 1
    plt.plot([0, 1], [0, 1], linestyle='--', lw=2, color='r',
             label='Chance', alpha=.8)

    mean_tpr = np.mean(tprs, axis=0)
    mean_tpr[-1] = 1.0
    mean_auc = auc(mean_fpr, mean_tpr)
    std_auc = np.std(aucs)
    plt.plot(mean_fpr, mean_tpr, color='b',
             label=r'Mean ROC (AUC = %0.2f $\pm$ %0.2f)' % (mean_auc, std_auc),
             lw=2, alpha=.8)

    std_tpr = np.std(tprs, axis=0)
    tprs_upper = np.minimum(mean_tpr + std_tpr, 1)
    tprs_lower = np.maximum(mean_tpr - std_tpr, 0)
    plt.fill_between(mean_fpr, tprs_lower, tprs_upper, color='grey', alpha=.2,
                     label=r'$\pm$ 1 std. dev.')

    plt.xlim([-0.05, 1.05])
    plt.ylim([-0.05, 1.05])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver operating characteristic example')
    plt.legend(loc="lower right")
    plt.show()

It happens when I run the following with the function:

model = XGBClassifier() # Create the Model
roc_plot(X, Y, Model)

EXPECTED RESULT

I should be able to feed the data, X and Y, into my function.

realr
  • 3,652
  • 6
  • 23
  • 34
OfSorts
  • 196
  • 1
  • 1
  • 15
  • 4
    Hello, Mary.I think the error rises because you are using data frames where it is desirable to use numpy arrays.If you look at the traceback, you can see that the error raised in line `probas_ = model.fit(X[train], y[train]).predict_proba(X[test])`; But `X` is a data frame, as I can see from your code. So, try do the following: replace lines `X = df_full.drop(['Attrition'], axis=1)` and `Y = df_full['Attrition'].as_matrix()` with `X = df_full.drop(['Attrition'], axis=1).values` and `Y = df_full['Attrition'].values`.It is better (and reliable) to work with numpy arrays when training models. – bubble Apr 27 '19 at 02:57
  • The error is not happening in the models, but instead, in the DataFrame indexing, as is shown one step further in the stack trace (`__getitem__`). I couldn't run the code here, but, for further debugging, I suggest isolating only the `kf.split` part and take a look at `X` and `y`, and testing which of `X[train]` or equivalent is failing. Best of luck! – araraonline Apr 27 '19 at 05:09
  • 1
    As a guess, it looks like `X[train]` is trying to select the columns of the `X`, when actually, you would want to select the rows. If that's the case, replacing`X[train]` and equivalent by `X.loc[train]` should work. – araraonline Apr 27 '19 at 05:22

1 Answers1

5

in this piece of code train, test are arrays of indices, while you using it as a columns when selection from DataFrame:

for train, test in kf.split(X, Y):
    probas_ = model.fit(X[train], Y[train]).predict_proba(X[test])

you should use iloc instead:

    probas_ = model.fit(X.iloc[train], Y.iloc[train]).predict_proba(X.iloc[test])
sashaostr
  • 625
  • 8
  • 16