Why can't I match LGBM's cv score?

Question

I'm unable to match LGBM's cv score by hand.

Here's a MCVE:

from sklearn.datasets import load_breast_cancer
import pandas as pd
from sklearn.model_selection import train_test_split, KFold
from sklearn.metrics import roc_auc_score
import lightgbm as lgb
import numpy as np

data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

folds = KFold(5, random_state=42)

params = {'random_state': 42}

results = lgb.cv(params, lgb.Dataset(X_train, y_train), folds=folds, num_boost_round=1000, early_stopping_rounds=100, metrics=['auc'])
print('LGBM\'s cv score: ', results['auc-mean'][-1])

clf = lgb.LGBMClassifier(**params, n_estimators=len(results['auc-mean']))

val_scores = []
for train_idx, val_idx in folds.split(X_train):
    clf.fit(X_train.iloc[train_idx], y_train.iloc[train_idx])
    val_scores.append(roc_auc_score(y_train.iloc[val_idx], clf.predict_proba(X_train.iloc[val_idx])[:,1]))
print('Manual score: ', np.mean(np.array(val_scores)))

I was expecting the two CV scores to be identical - I have set random seeds, and done exactly the same thing. Yet they differ.

Here's the output I get:

LGBM's cv score:  0.9851513530737058
Manual score:  0.9903622177441328

Why? Am I not using LGMB's cv module correctly?

Could it be that your manual classifier doesn't use the same `num_boost_round` and `early_stopping_rounds` parameters? As I can see, you don't pass them explicitly into the `__init__` method or when calling `fit`. — devforfu, Feb 15 '19 at 12:56
`num_boost_round` is the same as `n_estimators`. And if I'm setting the number of estimators explicitly, there's no need for early stopping — EuRBamarth, Feb 15 '19 at 13:07
Ah, ok, but if we have early stopping enabled in one case, and not enabled in another one, could it be the reason of the difference? — devforfu, Feb 15 '19 at 13:35
In the `cv` case, you set a maximum number of iterations, and stop once your cv score hasn't improved for more than `early_stopping_rounds` iterations. In the other case, I directly set that number as the number of iterations to go through — EuRBamarth, Feb 15 '19 at 13:53

Florian Mutel · Accepted Answer · 2019-02-15T14:12:51.520

You are splitting X into X_train and X_test. For cv you split X_train into 5 folds while manually you split X into 5 folds. i.e you use more points manually than with cv.

change results = lgb.cv(params, lgb.Dataset(X_train, y_train) to results = lgb.cv(params, lgb.Dataset(X, y)

Futhermore, there can be different parameters. For example, the number of threads used by lightgbm changes the result. During cv the models are fitted in parallel. Hence the number of threads used might differ from your manual sequential training.

EDIT after 1st correction:

You can achieve the same results using manual splitting / cv using this code:

from sklearn.datasets import load_breast_cancer
import pandas as pd
from sklearn.model_selection import train_test_split, KFold
from sklearn.metrics import roc_auc_score
import lightgbm as lgb
import numpy as np

data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

folds = KFold(5, random_state=42)


params = {
        'task': 'train',
        'boosting_type': 'gbdt',
        'objective':'binary',
        'metric':'auc',
        }

data_all = lgb.Dataset(X_train, y_train)

results = lgb.cv(params, data_all, 
                 folds=folds.split(X_train), 
                 num_boost_round=1000, 
                 early_stopping_rounds=100)

print('LGBM\'s cv score: ', results['auc-mean'][-1])

val_scores = []
for train_idx, val_idx in folds.split(X_train):

    data_trd = lgb.Dataset(X_train.iloc[train_idx], 
                           y_train.iloc[train_idx], 
                           reference=data_all)

    gbm = lgb.train(params,
                    data_trd,
                    num_boost_round=len(results['auc-mean']),
                    verbose_eval=100)

    val_scores.append(roc_auc_score(y_train.iloc[val_idx], gbm.predict(X_train.iloc[val_idx])))
print('Manual score: ', np.mean(np.array(val_scores)))

yields

LGBM's cv score:  0.9914524426410262
Manual score:  0.9914524426410262

What makes the difference is this line reference=data_all. During cv, the binning of the variables (refers to lightgbm doc) is constructed using the whole dataset (X_train) while in you manual for loop it was built on the training subset (X_train.iloc[train_idx]). By passing the reference to the dataset containg all the data, lightGBM will reuse the same binning, giving same results.

That's a good spot - I have fixed that error. However, the results still don't match. So once I have a result from `lightgbm.cv`, how am I supposed to use it, if `lightgbm.LGBMClassifier` doesn't behave in exactly the same way? Do I just accept their differences? — EuRBamarth, Feb 15 '19 at 13:41
the problem was the binning process. I am editing my post to give you a reproductible example — Florian Mutel, Feb 15 '19 at 13:54
Aw, this is fantastic! Thanks Florian, I really appreciate it — EuRBamarth, Feb 15 '19 at 14:08
And you asked about the way to use it, a good one is to retrain a model using some more trees (like len(results['auc-mean']) *1.1) on the whole dataset (data_all) without validation. You should expect a performance improvement on your X_test/y_test split by doing so. (You add more trees because you use more data). — Florian Mutel, Feb 15 '19 at 14:11

Why can't I match LGBM's cv score?

1 Answers1