How to split train data and validation data properly in K fold cross validation

Question

First, as a non-English speaker, I am using a translator to solve my problem. I ask for your understanding if the sentence is awkward and difficult to read.

I try to learn data through Kfold cross validation. However, continuous errors occur in the process of dividing train data for kfold. Following code is my data set.

df_test = df_data.iloc[50001:, :] #Test set
df_use = df_data.iloc[0:50000, :] #Training set
    
x_test = df_test.drop(['upgraded'], axis = 1)
y_test = df_test['upgraded']
    
x = df_use.drop(['upgraded'], axis = 1)
y = df_use['upgraded']

And every time I try to split train data and validation data, error occurs.

for train_ix, val_ix in kfold.split(x):

    trainX, trainy = x[train_ix], y[train_ix]
    valX, valy = x[val_ix], y[val_ix]


    model, val_acc = evaluate_model(trainX, trainy, valX, valy)

I'm not sure this will help, but when I use this code, trainX, trainy = x[train_ix], y[train_ix] this error message occurs.

KeyError: "None of [Int64Index([10000, 10001, 10002, 10003, 10004, 10005, 10006, 10007, 10008,\n 10009,\n ...\n 49990, 49991, 49992, 49993, 49994, 49995, 49996, 49997, 49998,\n 49999],\n dtype='int64', length=40000)] are in the [columns]"

So I switched that code like this.

for train_ix, val_ix in kfold.split(x):

  trainX, valX = x.iloc[train_ix], x.iloc[val_ix]
  trainy, valy = y.iloc[train_ix], y.iloc[val_ix]

model, val_acc = evaluate_model(trainX, trainy, valX, valy)

And this time, model, val_acc = evaluate_model(trainX, trainy, valX, valy) this code gets the error.

IndexError: index -9223372036854775808 is out of bounds for axis 1 with size 2

So I tried this code as well. (I sliced df_use with train_test_split.) Same index error occurs.

inputs = np.concatenate((x_train, x_val), axis=0)
targets = np.concatenate((y_train, y_val), axis=0)

I want to split and put the data in the right way so that the kfold cross validation model recognizes my data and can run the model. It would be very helpful if someone helped.

Kedar U Shet · Answer 1 · 2021-12-12T10:09:29.317

2

You can try the following

from sklearn.model_selection import KFold

df_test = df_data.iloc[50001:, :] #Test set
df_use = df_data.iloc[0:50000, :] #Training set
    
y_test = df_test['upgraded']
x_test = df_test.drop(['upgraded'], axis = 1)
    
y = df_use['upgraded']
x = df_use.drop(['upgraded'], axis = 1)

kf = KFold(n_splits=2)

for train_index, test_index in kf.split(x):
    trainX, valX = x.take(list(train_index),axis=0), x.take(list(test_index),axis=0)
    trainy, valy = y.take(list(train_index),axis=0), y.take(list(test_index),axis=0)
model, val_acc = evaluate_model(trainX, trainy, valX, valy)

I hope this works. Please comment below if any issue faced.

edited Dec 12 '21 at 10:09

answered Dec 12 '21 at 09:15

Kedar U Shet

538
2
11

Hello, first, thanks for your answer. I tried your code but unfortunately, when I used your code, there was the same key error that I mentioned in my questions. I had to switch `x[train_index]` into `x.iloc[train_index]` instead. (All of them) But that same Index error came out in this code as always... `model, val_acc = evaluate_model(trainX, trainy, valX, valy)` – Margaret Stark Dec 12 '21 at 09:51
Hey, I have changed the part where the error occurred, please try. – Kedar U Shet Dec 12 '21 at 10:10
Oh, I'm sorry. It seems that an error occurred as the code was modified without initializing the existing runtime. So after initializing the runtime, as you advised, I tried the code you gave me and it still had the same error of **IndexError: index -9223372036854775808 is out of bounds for axis 1 with size 2** in `model, val_acc = evaluate_model(trainX, trainy, valX, valy)`. To add in case it helps, my data has 12 explanatory variables, 1 dependent variable, and 75000 rows. – Margaret Stark Dec 12 '21 at 10:52

How to split train data and validation data properly in K fold cross validation

1 Answers1