I am using Dask and Rapidsai to run an xgboost model on a large (6.9GB) dataset. The hardware is 4x 2080 TIs with 11 GB of memory each. The raw dataset has a few dozen target columns that have been one-hot encoded, so I am trying to run a loop that keeps one target column at a time, drops the rest, runs the model, and repeats.
Here's some pseudocode to show the process:
with LocalCUDACluster(n_workers=4) as cluster:
with Client(cluster) as client:
raw_data = dask_cudf.read_csv(MyFile)
d_train, d_test = 'function to drop all target columns but one, then create
DaskDMatrix objects for training and testing'
model = 'Custom gridsearch function that loops through xgb.dask.train function and
returns a dict of optimal params'
Ok, so if I run that on any individual target column, it works fine-- pushes up against available memory, but doesn't break. If I try to do it with a loop after the cluster/client assignment:
targets = ['target1', 'target2', ...]
with LocalCUDACluster(n_workers=4) as cluster:
with Client(cluster) as client:
for target in targets:
same code as above
it breaks on the third run through the loop, running out of memory. If, instead, I assign the cluster/client within the loop:
targets = ['target1', 'target2', ...]
for target in targets:
with LocalCUDACluster(n_workers=4) as cluster:
with Client(cluster) as client:
same code as above
it works fine.
So, I'm guessing that some of the objects I created in a loop run are obviously persisting, right? So, I tried setting d_train and d_test to empty lists at the end of each loop to clear out the memory, but that didn't work. It also doesn't matter if I read in the raw data before the loop or within it, or any other changes.
The only thing that seems to work is to reinitialize the cluster and client at the end of each loop.
Why is that? Is there a more efficient way of handling memory in a loop over dask partitions? I've read that dask is very inefficient in loops, anyway, so maybe I shouldn't even be doing things this way.
Any insight into memory management with dask or rapids would be greatly appreciated.