label encoding in dask_cudf dataframe

Question

I am trying to use dask_cudf to preprocess a very large dataset (150,000,000+ records) for multi-class xgboost training and am having trouble encoding the class column (dtype is string). I tried using the 'replace' function, but the error message said the two dtypes must match. I tried using dask_ml.LabelEncoder, but it said string arrays aren't supported in cudf. I tried using compute() in various ways, but i kept running into out-of-memory errors (i'm assuming because operations on cudf dataframe require a smaller dataset). I also tried pulling the class column out, encoding, and then merging it back with the dataframe, but the partitions do not line up. I tried manually lining them up, but dask_cudf seemingly does not support repartioning using 'divisions' parameter (got error saying something like 'old and new partitions do not match'). Any help on how to do this would be much appreciated.

can you share a minimum repro for the partitions not linking up? — TaureanDyerNV, Jul 01 '22 at 23:23
@TaureanDyerNV unfortunately i cannot directly share my code — Tejas Sriram, Jul 06 '22 at 16:26
A minimum reproducible is doesn't have to be sharing your actual code, and shouldn't be. It's sharing an concise example of what you're trying to do that shows the failure :) — TaureanDyerNV, Jul 07 '22 at 17:29

score 0 · Answer 1 · answered Jul 01 '22 at 23:22

Strings aren't supported on xgboost. Not having seen your data, here are a few ways quick and dirty ways I've modified string columns to train, as generally strings may not matter:

If the strings were actually numeric (like dates), converting to int (int8 int16, int32)
I did this by hashmapping the strings and then running xgboost (basically creating a reversible conversion between string and integer as long as you don't change the integer) and train on your current, now hashed as an integer, column.
if the strings are classes, manually naming class numbers (0,1,2,...,n) in a new column and train on that one.

There are definitely other, better ways. As for the second part of your question, left a comment.

Now, your XGBoost model and your dask-cudf dataframe per-GPU allocation must fit on a single GPU, or you will get memory errors. If your model will be considering a large amount of data, please train on the largest GPU memory sized cluster you can. A100s can have 40GB and 80GB. Some older compute GPUs, V100 and GV100 have 32GB. A6000 and RTX8000 have 48GB. then it goes to 24, 16, and lower from there. Please size your GPUs accordingly

label encoding in dask_cudf dataframe

1 Answers1