How can I use xgboost.dask with gpu to model a very large dataset in both a distributed and batched manner?

Question

I would like to utilise multiple GPUs spread across many nodes to train an XGBoost model on a very large data set within Azure Machine Learning using 3 NC12s_v3 compute nodes. The dataset size exceeds both VRAM and RAM size when persisted into Dask, but comfortably fits on disk. However, XGBoost's dask module seems to persist all data when training (At least by default).

All data preprocessing has been handled (One hot encoding with np.bool data type) and one can assume that I have the most efficient data types elsewhere (For example, changing np.float64 to np.float32 for decimal features, changing to int8 for ordinal data, etc.).

Currently, I am trying to get a simple toy model working with just a training set. My code is as follows:

from dask_cuda import LocalCUDACluster
import dask
import xgboost as xgb
import pandas as pd
import numpy as np
import distributed
from dask.distributed import Client, wait, LocalCluster

# Start cluster and client. This is currently local, although I would like to make this distributed across many nodes.
cluster = LocalCUDACluster(n_workers=2, device_memory_limit='16 GiB')
client = distributed.Client(cluster)

# Read in training data.
train_dd = dask.dataframe.read_parquet("folder_of_training_data/*.parquet", chunksize="128 MiB")

# Split into features and target, all other preprocessing including one hot encoding has been completed.
X = train_dd[train_dd.columns.difference(['label'])]
y = train_dd['label']

# Delete dask dataframe to free up memory.
del train_dd

# Create DaskDMatrix from X, y inputs.
dtrain = xgb.dask.DaskDMatrix(client, X, y)

# Delete X and y to free up memory.
del X
del y

# Create watchlist for input into xgb train method.
watchlist = [(dtrain, 'train')]

# Train toy booster on 10 rounds with subsampling and gradient based sampling to reduce memory requirements.
bst = xgb.dask.train(
    client,
    {
        'predictor': 'gpu_predictor', 
        'tree_method': 'gpu_hist', 
        'verbosity': 2, 
        'objective': 'binary:logistic',
        'sampling_method': 'gradient_based',
        'subsample': 0.1
    }, 
    dtrain,
    num_boost_round=10,
    evals=watchlist
)
    
del dtrain

print("History:", str(bst['history']))

With the above on a single node containing 2 GPUs, I can only load up to 32GB at a time (Limits of VRAM).

From my current code, I have a few questions:

Is there any way I can stop XGBoost from persisting all data into memory, instead perhaps working through partitions in batches?
Is there any way I can get Dask to natively handle the batch process rather than manually performing for example, incremental learning?
In the docs they mention that you can use external memory mode along with their distributed mode. Assuming I had libsvm files, how would I go about this with multiple nodes & multiple GPUs?
How can I alter my code above such that I can work with more than one node?
Bonus question: Assuming either there is a way to batch process with xgboost.dask, how can I integrate this with RAPIDS for processing purely on GPUs?

hi @HowdyEarth, did you solved this problem? – colin-zhou Apr 10 '21 at 10:50 — colin-zhou, Apr 10 '21 at 10:50

score 2 · Answer 1 · answered Jul 02 '20 at 17:17

Currently, Dask-XGBoost expects your data to be resident in GPU memory. Experimental work is in progress on both out-of-core spilling to host memory and incremental data loading, but it's probably too early to deploy one of those.

Which version of XGBoost are you using? XGBoost 1.1 significantly reduces GPU memory usage in many cases, so that may be worth a shot.

However, since you are on a cloud provider, spinning up more GPUs is probably the easiest option (and if it makes your training faster, you can shut down those instances faster and keep your cost stable).

Basically, you will start a dask scheduler on your master node, then start Dask GPU workers on each of the worker nodes. (E.g. by running the dask-cuda-worker command line program on each worker.)

There are several older tutorials to set up a Dask cluster on Azure, but this new tutorial video from Tom Drabas and the AzureML team may be the best starting point: https://www.youtube.com/watch?v=zyDpbH33LXE&feature=youtu.be It leverages Dask cloud provider to simplify the cluster setup.

How can I use xgboost.dask with gpu to model a very large dataset in both a distributed and batched manner?

1 Answers1