Why RandomForestClassifier on CPU (using SKLearn) and on GPU (using RAPIDs) get differents scores, very different?

Question

I am using RandomForestClassifier on CPU with SKLearn and on GPU using RAPIDs. I am doing a benchmark between these two libraries about speed up and scoring using Iris dataset (it is a try, in the future, I will change the dataset for a better benchmarking, I am starting with these two libraries).

The problem is when I measure the score on CPU always get a value of 1.0 but when I try to measure the score on GPU I get a variable value between 0.2 and 1.0 and I do not understand why could be it happening.

First of all, libraries version I am using are:

NumPy Version: 1.17.5
Pandas Version: 0.25.3
Scikit-Learn Version: 0.22.1
cuPY Version: 6.7.0
cuDF Version: 0.12.0
cuML Version: 0.12.0
Dask Version: 2.10.1
DaskCuda Version: 0+unknown
DaskCuDF Version: 0.12.0
MatPlotLib Version: 3.1.3
SeaBorn Version: 0.10.0

The code I use for SKLearn RandomForestClassifier is:

# Read data in host memory
host_s_csv = pd.read_csv('./DataSet/iris.csv', header = 0, delimiter = ',') # Get complete CSV
host_s_data = host_s_csv.iloc[:, [0, 1, 2, 3]].astype('float32') # Get data columns
host_s_labels = host_s_csv.iloc[:, 4].astype('category').cat.codes # Get labels column

# Plot data
#sns.pairplot(host_s_csv, hue = 'variety');

# Split train and test data
host_s_data_train, host_s_data_test, host_s_labels_train, host_s_labels_test = sk_train_test_split(host_s_data, host_s_labels, test_size = 0.2, random_state = 0)

# Create RandomForest model
sk_s_random_forest = skRandomForestClassifier(n_estimators = 40,
                                             max_depth = 16,
                                             max_features = 1.0,
                                             random_state = 10, 
                                             n_jobs = 1)

# Fit data in RandomForest
sk_s_random_forest.fit(host_s_data_train, host_s_labels_train)

# Predict data
sk_s_random_forest_labels_predicted = sk_s_random_forest.predict(host_s_data_test)

# Check score
print('accuracy_score: ', sk_accuracy_score(host_s_labels_test, sk_s_random_forest_labels_predicted))

The code I use for RAPIDs RandomForestClassifier is:

# Read data in device memory
device_s_csv = cudf.read_csv('./DataSet/iris.csv', header = 0, delimiter = ',') # Get complete CSV
device_s_data = device_s_csv.iloc[:, [0, 1, 2, 3]].astype('float32') # Get data columns
device_s_labels = device_s_csv.iloc[:, 4].astype('category').cat.codes # Get labels column

# Plot data
#sns.pairplot(device_s_csv.to_pandas(), hue = 'variety');

# Split train and test data
device_s_data_train, device_s_data_test, device_s_labels_train, device_s_labels_test = cu_train_test_split(device_s_data, device_s_labels, train_size = 0.8, shuffle = True, random_state = 0)

# Use same data as host
#device_s_data_train = cudf.DataFrame.from_pandas(host_s_data_train)
#device_s_data_test = cudf.DataFrame.from_pandas(host_s_data_test)
#device_s_labels_train = cudf.Series.from_pandas(host_s_labels_train).astype('int32')
#device_s_labels_test = cudf.Series.from_pandas(host_s_labels_test).astype('int32')

# Create RandomForest model
cu_s_random_forest = cusRandomForestClassifier(n_estimators = 40,
                                               max_depth = 16,
                                               max_features = 1.0,
                                               n_streams = 1)

# Fit data in RandomForest
cu_s_random_forest.fit(device_s_data_train, device_s_labels_train)

# Predict data
cu_s_random_forest_labels_predicted = cu_s_random_forest.predict(device_s_data_test)

# Check score
print('accuracy_score: ', cu_accuracy_score(device_s_labels_test, cu_s_random_forest_labels_predicted))

And an example of the iris dataset I am using is:

Do you know why could be it happening? Both models are set equal, same parameters,... I have no idea why this big difference between scores.

Thank you.

I'm don't know the RAPIDs library, but if a computation is done on GPU it usually takes a data formating step before. So i would say either something in the formating step, either in the calculation performed. Do you happen to know which part of the algorithm is computed on GPU ? (big difference in result of this algorithm usually means difference in the way splitting rules are computed) — Bruce Swain, Mar 12 '20 at 12:27

score 3 · Accepted Answer · answered Mar 12 '20 at 17:30

This is caused by a known issue in our predict code, which was corrected in 0.13 with a warning and fall back to CPU on multi-class classifications. In version 0.12, we didn't have the warning or fallback, so, if you didn't know to use predict_model="CPU' on a multi-class classification, you'd get a [much] lower prediction score than you should with the model you just fit.

See issue here: https://github.com/rapidsai/cuml/issues/1623

Here's some code to help you and others. It's been modified so it is a bit easier for others in the future. I get ~ 0.9333 on a GV100 and RAPIDS 0.12 stable.

import cudf as cu
from cuml.ensemble import RandomForestClassifier as cusRandomForestClassifier
from cuml.metrics import accuracy_score as cu_accuracy_score
from cuml.preprocessing.model_selection import train_test_split as cu_train_test_split
import numpy as np

# data link: https://gist.githubusercontent.com/curran/a08a1080b88344b0c8a7/raw/639388c2cbc2120a14dcf466e85730eb8be498bb/iris.csv

# Read data
df = cu.read_csv('./iris.csv', header = 0, delimiter = ',') # Get complete CSV

# Prep data
X = df.iloc[:, [0, 1, 2, 3]].astype(np.float32) # Get data columns.  Must be float32 for our Classifier
y = df.iloc[:, 4].astype('category').cat.codes # Get labels column.  Will convert to int32

cu_s_random_forest = cusRandomForestClassifier(
                                           n_bins = 16, 
                                           n_estimators = 40,
                                           max_depth = 16,
                                           max_features = 1.0,
                                           n_streams = 1)

train_data, test_data, train_label, test_label = cu_train_test_split(X, y, train_size=0.8)

# Fit data in RandomForest
cu_s_random_forest.fit(train_data,train_label)

# Predict data
predict = cu_s_random_forest.predict(test_data, predict_model="CPU") # use CPU to do multi-class classifications
print(predict)

# Check score
print('accuracy_score: ', cu_accuracy_score(test_label, predict))

Is it normal that GPU based models from cuml, xgboost, lightgbm consistently score several percentage points lower than their CPU siblings? — Sergey Bushmanov, Mar 12 '20 at 22:36
Thank you for your explanation, @TaureanDyenNV. Now I understand where was the problem. Are you planning to include a GPU multi-class predictive model in the near future? — JuMoGar, Mar 13 '20 at 12:07
@SergeyBushmanov we sorta go back and forth. I've had accuracy scores where we're far better than CPU, some where we are worse, but not by much (unless something is wrong). Lots of variables for non deterministic training, like random forest. However, if we're scoring significantly worse, we'd love to know and see what is wrong to improve it. Can you share some examples on our slack channel? — TaureanDyerNV, Mar 20 '20 at 19:08
@JuMoGar we are constantly improving our code. looking at https://github.com/rapidsai/cuml/pull/1757, seems we were trying to get it in for 0.13, but pushed it to 0.14. Please join our community github, slack, and twitter, so you can communicate with us and we can keep you better updated https://rapids.ai/community.html#rapids-community — TaureanDyerNV, Mar 20 '20 at 19:14

Vishal Mehta · Answer 2 · 2020-03-12T17:00:46.827

I tried this from your example above , converted things to numpy and it worked

import numpy as np
train_label_np = host_s_labels_train.as_matrix().astype(np.int32)
train_data_np = host_s_data_train.as_matrix().astype(np.float32)
test_label_np = host_s_labels_test.as_matrix().astype(np.int32)
test_data_np = host_s_data_test.as_matrix().astype(np.float32)

cu_s_random_forest = cusRandomForestClassifier(n_estimators = 40,
                                           max_depth = 16, n_bins =16,
                                           max_features = 1.0,
                                           n_streams = 1)

# Fit data in RandomForest
cu_s_random_forest.fit(train_data_np,train_label_np)

# Predict data (GPU does not predict for multi-class at the moment. Fixed in 0.13)
predict_np = cu_s_random_forest.predict(test_data_np, predict_model='CPU')

# Check score
print('accuracy_score: ', sk_accuracy_score(test_label_np, predict_np))

BTW, this also works with cudf as well instead of numpy. I used cuml-0.13 — Vishal Mehta, Mar 12 '20 at 13:56

Why RandomForestClassifier on CPU (using SKLearn) and on GPU (using RAPIDs) get differents scores, very different?

2 Answers2

Linked