I would like to utilise multiple GPUs spread across many nodes to train an XGBoost model on a very large data set within Azure Machine Learning using 3 NC12s_v3 compute nodes. The dataset size exceeds both VRAM and RAM size when persisted into Dask, but comfortably fits on disk. However, XGBoost's dask module seems to persist all data when training (At least by default).
All data preprocessing has been handled (One hot encoding with np.bool data type) and one can assume that I have the most efficient data types elsewhere (For example, changing np.float64 to np.float32 for decimal features, changing to int8 for ordinal data, etc.).
Currently, I am trying to get a simple toy model working with just a training set. My code is as follows:
from dask_cuda import LocalCUDACluster
import dask
import xgboost as xgb
import pandas as pd
import numpy as np
import distributed
from dask.distributed import Client, wait, LocalCluster
# Start cluster and client. This is currently local, although I would like to make this distributed across many nodes.
cluster = LocalCUDACluster(n_workers=2, device_memory_limit='16 GiB')
client = distributed.Client(cluster)
# Read in training data.
train_dd = dask.dataframe.read_parquet("folder_of_training_data/*.parquet", chunksize="128 MiB")
# Split into features and target, all other preprocessing including one hot encoding has been completed.
X = train_dd[train_dd.columns.difference(['label'])]
y = train_dd['label']
# Delete dask dataframe to free up memory.
del train_dd
# Create DaskDMatrix from X, y inputs.
dtrain = xgb.dask.DaskDMatrix(client, X, y)
# Delete X and y to free up memory.
del X
del y
# Create watchlist for input into xgb train method.
watchlist = [(dtrain, 'train')]
# Train toy booster on 10 rounds with subsampling and gradient based sampling to reduce memory requirements.
bst = xgb.dask.train(
client,
{
'predictor': 'gpu_predictor',
'tree_method': 'gpu_hist',
'verbosity': 2,
'objective': 'binary:logistic',
'sampling_method': 'gradient_based',
'subsample': 0.1
},
dtrain,
num_boost_round=10,
evals=watchlist
)
del dtrain
print("History:", str(bst['history']))
With the above on a single node containing 2 GPUs, I can only load up to 32GB at a time (Limits of VRAM).
From my current code, I have a few questions:
Is there any way I can stop XGBoost from persisting all data into memory, instead perhaps working through partitions in batches?
Is there any way I can get Dask to natively handle the batch process rather than manually performing for example, incremental learning?
In the docs they mention that you can use external memory mode along with their distributed mode. Assuming I had libsvm files, how would I go about this with multiple nodes & multiple GPUs?
How can I alter my code above such that I can work with more than one node?
Bonus question: Assuming either there is a way to batch process with xgboost.dask, how can I integrate this with RAPIDS for processing purely on GPUs?