2

Trying to create a KFold object for my xgboost.cv, and I have

import pandas as pd
from sklearn.model_selection import KFold

df = pd.DataFrame([[1,2,3,4,5],[6,7,8,9,10]])

KF = KFold(n_splits=2)
kf = KF.split(df)

But it seems I can only enumerate once:

for i, (train_index, test_index) in enumerate(kf):
    print(f"Fold {i}")

for i, (train_index, test_index) in enumerate(kf):
    print(f"Again_Fold {i}")

gives output of

Fold 0
Fold 1

The second enumerate seems to be on an empty object.

I am probably fundamentally understanding something wrong, or completed messed up somewhere, but could someone explain this behavior?

[Edit, adding follow up question] This behavior seems to cause passing KFold object to xgboost.cv setting xgboost.cv(..., folds = KF.split(df)) to have index out of range error. My fix is to recreate the list of tuples with

kf = []
for i, (train_index, test_index) in enumerate(KF.split(df)):
    this_split = (list(train_index), list(test_index))
    kf.append(this_split)

xgboost.cv(..., folds = kf)

looking for smarter solutions.

Yue Y
  • 583
  • 1
  • 6
  • 24
  • 2
    The [documentation for KFold.split()](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html#sklearn.model_selection.KFold.split) indicates it yields rather than returns, which would suggest that it's a generator rather than an iterable. If memory serves me correctly generators can only be iterated through once. – Shorn Feb 23 '23 at 01:44
  • As @Shorn is mentioning, kf is a generator and you cannot iterate over a generator twice. See [here](https://stackoverflow.com/questions/45400013/can-generator-be-used-more-than-once) – Mattravel Feb 23 '23 at 03:06

1 Answers1

1

Using an example:

from sklearn.model_selection import KFold
import xgboost as xgb
import numpy as np

data = np.random.rand(5, 10)  # 5 entities, each contains 10 features
label = np.random.randint(2, size=5)  # binary target
dtrain = xgb.DMatrix(data, label=label)

param = {'max_depth': 2, 'eta': 1, 'objective': 'binary:logistic'}

If we run your code :

KF = KFold(n_splits=2)
xgboost.cv(params= param,dtrain=dtrain, folds = KF.split(df))

I get the error :

IndexError                                Traceback (most recent call last)
Cell In[51], line 2
      1 KF = KFold(n_splits=2)
----> 2 xgboost.cv(params= param,dtrain=dtrain, folds = KF.split(df))
[..]

IndexError: list index out of range

In the documentation, it ask for a KFold instance, so you just need to do:

KF = KFold(n_splits=2)
xgb.cv(params= param,dtrain=dtrain, folds = KF)

You can check out the source code and see that it will call the split method, so you don't need to provide KF.split(..) .

StupidWolf
  • 45,075
  • 17
  • 40
  • 72