cuML vs sklearn: different accuracies for random forest classifier

Question

I am using the rapidsai docker container as obtained via

docker pull rapidsai/rapidsai:cuda10.0-runtime-ubuntu18.04
docker run --gpus all --rm -it -p 8888:8888 -p 8787:8787 -p 8786:8786 \
    rapidsai/rapidsai:cuda10.0-runtime-ubuntu18.04

and have started it via

docker run --memory=30g --cpus=12 --gpus all --rm -it \
    -p 8888:8888 -p 8787:8787 -p 8786:8786 \
    rapidsai/rapidsai:cuda10.0-runtime-ubuntu18.04-py3.6

When I run the random_forest_mnmg_demo via JupyterLab, I get the following accuracies

SKLearn accuracy:   0.867
CuML accuracy:      0.833

While the notebook says that

Due to randomness in the algorithm, you may see slight variation in accuracies

I would not call this difference "slight".

As a side note: I have also tested and modified the other RF notebook (random_forest_demo) and observed accuracy differences as large as 0.95 vs 0.75 (for different data set sizes and RF parameters). According to the cuML documentation, the cuML node split algorithm is different from sklearn. Therefore, I have changed split_algo = 0 and tried various n_bins values - without success. I have also tested h2os RF implementation on random_forest_demo and h2o and sklearn give very similar results most of the time.

There is a similar question on SO, but it seems that this issue was related to cuML version 0.12 and should have been fixed in version 0.14, which I am using. So there must be something else going on.

I have compared the sklearn and cuML parameter settings for RF and I think they should be close enough to produce similar results. Did I miss some configuration settings? Or might this be hardware related?

nvidia-smi output (executed on host machine, GPU is "GeForce GTX 1050 Ti with Max-Q Design")

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.36.06    Driver Version: 450.36.06    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce GTX 105...  On   | 00000000:01:00.0 Off |                  N/A |
| N/A   64C    P0    N/A /  N/A |   1902MiB /  4042MiB |      8%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

Cuda version as given by nvcc --version

Cuda compilation tools, release 10.0, V10.0.130

Run the random_forest_mnmg_demo a couple of times and collect a few accuracies. It's hard / impossible to tell how far 2 distributions are (in this case distributions refer to the accuracy) with 1 value. — Pani, Jun 23 '20 at 09:38
@Pani I have trained the `sklearn` and `cuML` models now with 3 different seeds and got the following accuracies `(0.865, 0.84), (0.865, 0.838), (0.865, 0.833)`. So, there is little variation in the performance numbers (more for `cuML` though). Some time ago, I did a more thorough analysis on how much `h2o` RF models change based on different seeds (based on 100 samples if I remember correctly). The variance of the performance numbers was also pretty low. My guess is that RF algos usually converge to very similar models (given "sensible" hyper parameters and the same training data). — cryo111, Jun 23 '20 at 10:33
IIUC, the difference in splitting algorithm would be most reduced by setting `n_bins` very very large? — Ben Reiniger, Jun 23 '20 at 13:28
@BenReiniger Have set `n_bins=200`. At that point, `cuML` takes longer than `sklearn` for model training. The accuracy gap still persists `(0.88, 0.85)` - note that these numbers are based on new training data. BTW: I have tried playing around with `n_bins` before. `h2o` also uses histograms and allows to set this parameter. Setting it to similar values as used for `cuML` still gives much better results - comparable to `sklearn`. — cryo111, Jun 23 '20 at 14:10
Same results on `rapidsai/rapidsai-nightly:cuda10.0-runtime-ubuntu18.04` using `cuML` v 0.15a (nightly). — cryo111, Jun 23 '20 at 14:11
Issue #2518 in cuML's github repo (https://github.com/rapidsai/cuml/issues/2518) is tracking this issue. — saloni, Aug 11 '20 at 18:59

cuML vs sklearn: different accuracies for random forest classifier

0 Answers0