My k-fold cross validation technique is giving error on my dataframe with deleted rows

Question

I hope this message finds you well. I have been working with a dataframe and I had to remove the rows which contained any null values. I used the following command to delete such rows. I have used the following command:

df.dropna(axis=0,how="any",inplace=True)

Then when I apply k-fold cross validation like this:

#Using kfold cross validation
from sklearn.model_selection import KFold, cross_val_predict
kf = KFold(shuffle=True, random_state=42, n_splits=5)
for train_index, test_index in kf.split(X):
    X_train, X_test, y_train, y_test = (X.iloc[train_index, :], 
                                        X.iloc[test_index, :], 
                                        y[train_index], 
                                        y[test_index])

I face the following error:

KeyError: "Passing list-likes to .loc or [] with any missing labels is no longer supported. The following labels were missing: Int64Index([    0,   149,   151,   156,   157,\n            ...\n            26474, 26987, 27075, 27157, 27345],\n           dtype='int64', length=1764). See https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike"

I do not know how to fix this. Its probably giving me an error because those rows do not exist and probably I have to reindex them again starting from zero and having proper index. I do not know how to do it. Can anyone suggest any good recommendation? Thanks

"I do not know how to fix this" - have you visited https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike as the error message suggests? — ForceBru, May 23 '22 at 12:31
Yes I have and I tried to use reindex() but its still giving me an error. — Sam, May 23 '22 at 12:39
@Sam are X and y slices of the same dataframe? Which dataframe have you applied the dropna function to? — Ishan Manchanda, May 23 '22 at 12:40
@IshanManchanda I have applied dropna function to df named dataframe and X and y are slices of those dataframes. — Sam, May 23 '22 at 13:25

score 0 · Answer 1 · answered May 23 '22 at 19:15

0

What I think you want is:

for train_index, test_index in kf.split(X):
    
    X_train, X_test, y_train, y_test = (X.iloc[train_index], 
                                        X.iloc[test_index], 
                                        y.iloc[train_index], 
                                        y.iloc[test_index])

I think your problem comes form the fact that you are using relative index number generated by kf.split(X) as index values on y[train_index] and y[test_index]. Your original could - by chance - work if the index in the X and y DF's indexes.

answered May 23 '22 at 19:15

jch

3,600
1
15
17

You mean that I should instead of using y.iloc[train_index] and y.iloc[test_index] I should use y_train and y_test???? – Sam May 24 '22 at 20:03
I'm saying that you should try what I have above. Using `iloc[]` across the board. It worked with my own generated data. – jch May 24 '22 at 21:08

My k-fold cross validation technique is giving error on my dataframe with deleted rows

1 Answers1