CUML: Random Forest Model Can't Be Trained on a Multi GPU Dask Cluster

Question

Based on the official distributed model training example (https://github.com/rapidsai/cuml/blob/branch-0.18/notebooks/random_forest_mnmg_demo.ipynb), I used the Iris dataset to train a random forest model on a multi GPU dask cluster (one scheduler node, three worker nodes), but the model can't be trained. The results are as following:

CuML accuracy:   0.36666666666666664
Exception ignored in: <object repr() failed>
Traceback (most recent call last):
  File "cuml/ensemble/randomforestclassifier.pyx", line 334, in cuml.ensemble.randomforestclassifier.RandomForestClassifier.__del__
  File "cuml/ensemble/randomforestclassifier.pyx", line 350, in cuml.ensemble.randomforestclassifier.RandomForestClassifier._reset_forest_data
AttributeError: 'NoneType' object has no attribute 'free_treelite_model'

Process finished with exit code 0

My environment is constructed by the conda command:

conda create -n rapids-0.18 -c rapidsai -c nvidia -c conda-forge \
    -c defaults rapids-blazing=0.18 python=3.8 cudatoolkit=10.2

The code I use for RAPIDs RandomForestClassifier is:

import pandas as pd
import cudf
import cuml
from cuml import train_test_split
from cuml.metrics import accuracy_score
from dask.distributed import Client, wait
from dask_cuda import LocalCUDACluster
import dask_cudf
from cuml.dask.ensemble import RandomForestClassifier as cumlDaskRF

# start dask cluster
c = Client('node0:8786')

# Query the client for all connected workers
workers = c.has_what().keys()
n_workers = len(workers)
n_streams = 8  # Performance optimization

# Random Forest building parameters
max_depth = 12
n_bins = 16
n_trees = 1000

# Read data
pdf = pd.read_csv('/data/iris.csv',header = 0, delimiter = ',') # Get complete CSV
cdf = cudf.from_pandas(pdf) # Get cuda dataframe
features = cdf.iloc[:, [0, 1, 2, 3]].astype('float32') # Get data columns
labels = cdf.iloc[:, 4].astype('category').cat.codes.astype('int32') # Get label column

# Split train and test data
X_train, X_test, y_train, y_test = train_test_split(feature, label, train_size=0.8, shuffle=True)

# Distribute data to worker GPUs
n_partitions = n_workers
X_train_dask = dask_cudf.from_cudf(X_train, npartitions=n_partitions)
X_test_dask = dask_cudf.from_cudf(X_test, npartitions=n_partitions)
y_train_dask = dask_cudf.from_cudf(y_train, npartitions=n_partitions)

# Train the distributed cuML model
cuml_model = cumlDaskRF(max_depth=max_depth, n_estimators=n_trees, n_bins=n_bins, n_streams=n_streams)
cuml_model.fit(X_train_dask, y_train_dask)

wait(cuml_model.rfs)  # Allow asynchronous training tasks to finish

# Predict and check accuracy
cuml_y_pred = cuml_model.predict(X_test_dask).compute().to_array()
print("CuML accuracy:  ", accuracy_score(y_test.to_array(), cuml_y_pred))

The results have not changed by using the LocalCUDACluster.

Can you point out my mistake and give me the correct code? And if I want to evaluate decision trees on the trained random forest model, how can I get those trained decision trees?

Thank you.

The code provided above has a few typos in the variable names present here: `train_test_split(feature, label, train_size=0.8, shuffle=True)`. Ignoring the typos, from the above error it seems like the code is able to run and print the cuML accuracy value and it fails after the provided code finishes. Do you get this error when your code exists or is there some code missing from the above example? — saloni, Apr 20 '21 at 21:31
@saloni thank you for your comments, it's just a typo and it won't appear in the running code. The example is complete and I will get this error when the code exists. The cuML accuracy is much lower than that of code example （https://stackoverflow.com/questions/60651169/why-randomforestclassifier-on-cpu-using-sklearn-and-on-gpu-using-rapids-get） — nomad, Apr 21 '21 at 02:15
The iris dataset is very small, it has just 150 samples. I would recommend you use the non dask implementation of Random Forest for small datasets. — saloni, Apr 21 '21 at 14:52
Please note that if you use dask RF for small datasets the accuracy will be lower than what you would get with single GPU RF implementation (https://docs.rapids.ai/api/cuml/stable/api.html?highlight=random%20forest#random-forest) Unfortunately, I am unable to reproduce the above error. If you would like to use dask RF then I would recommend you rapids/cuml version 0.19 and then use the `broadcast_data` variable in the `fit` function and see if you still get the above error. — saloni, Apr 21 '21 at 15:02
@saloni thank you for your suggestion, I use the `broadcast_data` variable in the `fit` function under `LocalCUDACluster`, and then the error is disappeared. But the cuML accuracy is lower than 0.5, and using `predict_model = 'CPU'` can get 1.0 accuracy. I think the problem is similar to the question [link](https://stackoverflow.com/questions/60651169/why-randomforestclassifier-on-cpu-using-sklearn-and-on-gpu-using-rapids-get?noredirect=1&lq=1). — nomad, Apr 22 '21 at 08:05
There is some difference seen in the accuracy obtained by using sklearn and cuML's RF model. There is an issue open in cuML's github repo (https://github.com/rapidsai/cuml/issues/3764) and the team is currently working on it. I would also like to reiterate that for best results please use dask models with large datasets and the non dask implementation of RF (https://docs.rapids.ai/api/cuml/nightly/api.html#random-forest) with smaller datasets — saloni, Apr 22 '21 at 16:35
Hi, would you mind setting `ignore_empty_partitions` as `True` and `broadcast_data` as `False` and testing your code to see if you reproduce can reproduce the above error with Dask RF? Thanks! — saloni, Apr 22 '21 at 19:29
@saloni I use the combination of `ignore_empty_partitions` and `broadcast_data` to test my code. The results show that the error with Dask RF is related to `broadcast_data` and not to `ignore_empty_partitions`. I don't change the `cuml_y_pred` to the array. If I set the `broadcast_data` as `True`, the error disappers and the type of `cuml_y_pred` is `cupy.core.core.ndarray`. If not, the error appears and the type of `cuml_y_pred` is `cudf.core.series.Series`. — nomad, Apr 23 '21 at 02:21

CUML: Random Forest Model Can't Be Trained on a Multi GPU Dask Cluster

0 Answers0