DaskML with XGBoost and using eval_set requires pre-computed data

Question

I am trying to run dask_ml.xgboost using eval_set to allow for early stopping in an attempt to avoid overfitting.

Currently, I have a sample dataset shown in the example below

from dask.distributed import Client
from dask_ml.datasets import make_classification_df
from dask_ml.xgboost import XGBClassifier


if __name__ == "__main__":
    n_train_rows = 4_000
    n_val_rows = 1_000

    client = Client()
    print(client)

    # Generate balanced data for binary classification
    X_train, y_train = make_classification_df(
        n_samples=n_train_rows,
        chunks=100,
        predictability=0.35,
        n_features=50,
        random_state=2,
    )
    X_val, y_val = make_classification_df(
        n_samples=n_val_rows,
        chunks=100,
        predictability=0.35,
        n_features=50,
        random_state=2,
    )

    clf = XGBClassifier(objective="binary:logistic")

    # train
    clf.fit(
        X_train,
        y_train,
        eval_metric="error",
        eval_set=[
            (X_train.compute(), y_train.compute()),
            (X_val.compute(), y_val.compute()),
        ],
        early_stopping_rounds=5,
    )

    # Make predictions
    y_pred = clf.predict(X_val).compute()
    assert len(y_pred) == len(y_val)

    client.close()

All of X_train, y_train, X_val and y_val are dask DataFrames.

I cannot specify eval_set as nested lists of dask DataFrames, using eval_set=[(X_train.compute(), y_train.compute()), (X_val.compute(), y_val.compute())]. Instead, they need to be pandas DataFrames, which is why I needed to call .compute() for each of them.

However, when I run the above code (using pandas DataFrames), I get this warning

<Client: 'tcp://127.0.0.1:12345' processes=4 threads=12, memory=16.49 GB>
/home/username/.../distributed/worker.py:3373: UserWarning: Large object of size 2.16 MB detected in task graph:
  {'dmatrix_kwargs': {}, 'num_boost_round': 100, 'ev ... ing_rounds': 5}
Consider scattering large objects ahead of time
with client.scatter to reduce scheduler burden and
keep data on workers

    future = client.submit(func, big_data)    # bad

    big_future = client.scatter(big_data)     # good
    future = client.submit(func, big_future)  # good
  warnings.warn(
task NULL connected to the tracker
task NULL connected to the tracker
task NULL connected to the tracker
task NULL connected to the tracker
task NULL got new rank 0
task NULL got new rank 1
task NULL got new rank 2
task NULL got new rank 3
[08:52:41] WARNING: ../src/gbm/gbtree.cc:129: Tree method is automatically selected to be 'approx' for distributed training.
[08:52:41] WARNING: ../src/gbm/gbtree.cc:129: Tree method is automatically selected to be 'approx' for distributed training.
[08:52:41] WARNING: ../src/gbm/gbtree.cc:129: Tree method is automatically selected to be 'approx' for distributed training.
[08:52:41] WARNING: ../src/gbm/gbtree.cc:129: Tree method is automatically selected to be 'approx' for distributed training.

This code runs through to completion and predictions are generated. However, the line estimator.fit(...) line is producing this UserWarning.

Additional notes

In my use-case, the number of rows in the training and validation splits used in the example here reflect sizes after sampling from the overall data. Unfortunately, the overall data splits required for training (+hyperparameter tuning) dask_ml.xgboost are a few orders of magnitude larger (in number of rows, based on training and validation learning curves, per dask_ml recommendations, generated using standard XGBoost (using from xgboost import XGBClassifier) without dask_ml's version of XGBoost (1, 2)) so I can't compute those and bring them into memory as pandas DataFrames for distributed XGBoost training.
The number of features used in the example here is 50. (In the real use-case) I arrived at this number after dropping as many features as possible.
Code is run on a local machine.

Question

Is there a correct/recommended approach to run dask_ml's xgboost with eval_set consisting of Dask DataFrames?

EDIT

Note that the the training split is also being passed in eval_set (in addition to the validation split) with the intention of generating learning curves using the output of model training (see here).

After further searching around, it looks like there is an open github issue dealing with this question - see [`dask-xgboost` #63](https://github.com/dask/dask-xgboost/issues/63). The difference being that here a pandas `DataFrame` needs to be passed in to `eval_set`, while a `numpy` array is being computed before passing into `eval_set` there. Looks like standard XGBoost might be warranted here, in order to leverage `eval_set`. — edesz, Nov 02 '20 at 05:04
Note that this github issue ([referenced above](https://stackoverflow.com/questions/64454312/daskml-with-xgboost-and-using-eval-set-requires-pre-computed-data#comment114293338_64454312)) refers to a project that has been [deprecated and its functionality has been absorbed into XGBoost](https://github.com/dask/dask-xgboost#readme). — edesz, Aug 03 '21 at 01:10

DaskML with XGBoost and using eval_set requires pre-computed data

0 Answers0