How do you do a grid search with cuml without a datatype error?

Question

I tried doing a grid search with cuml. (rapids 21.10) I get a cupy conversion error. This doesn't happen if I build the model with the same dataset without a grid search. It also works doing it with the Data not lying in Videomemory, but it is then obviously slower than cpu. The data is float32 for X and int32 for y:

X_cudf_train = cudf.DataFrame.from_pandas(X_train)
X_cudf_test = cudf.DataFrame.from_pandas(X_test)

y_cudf_train = cudf.Series(y_train.values)

RF_classifier_cu = RandomForestClassifier_cu(random_state = 123)
grid_search_RF_cu = GridSearchCV_cu(estimator=RF_classifier_cu, param_grid=grid_RF, cv=3, verbose=1)
grid_search_RF_cu.fit(X_cudf_train,y_cudf_train)
print(grid_search_RF_cu.best_params_)

The error:

 /home/asdanjer/miniconda3/envs/rapids-21.10/lib/python3.8/site-packages/cuml/internals/api_decorators.py:794: UserWarning: For reproducible results in Random Forest Classifier or for almost reproducible results in Random Forest Regressor, n_streams==1 is recommended. If n_streams is > 1, results may vary due to stream/thread timing differences, even when random_state is set
      return func(**kwargs)
    
    ---------------------------------------------------------------------------
    TypeError                    

         Traceback (most recent call last)
<timed exec> in <module>

~/miniconda3/envs/rapids-21.10/lib/python3.8/site-packages/sklearn/model_selection/_search.py in fit(self, X, y, groups, **fit_params)
    800         fit_params = _check_fit_params(X, fit_params)
    801 
--> 802         cv_orig = check_cv(self.cv, y, classifier=is_classifier(estimator))
    803         n_splits = cv_orig.get_n_splits(X, y, groups)
    804 

~/miniconda3/envs/rapids-21.10/lib/python3.8/site-packages/sklearn/model_selection/_split.py in check_cv(cv, y, classifier)
   2301             classifier
   2302             and (y is not None)
-> 2303             and (type_of_target(y) in ("binary", "multiclass"))
   2304         ):
   2305             return StratifiedKFold(cv)

~/miniconda3/envs/rapids-21.10/lib/python3.8/site-packages/sklearn/utils/multiclass.py in type_of_target(y)
    277         raise ValueError("y cannot be class 'SparseSeries' or 'SparseArray'")
    278 
--> 279     if is_multilabel(y):
    280         return "multilabel-indicator"
    281 

~/miniconda3/envs/rapids-21.10/lib/python3.8/site-packages/sklearn/utils/multiclass.py in is_multilabel(y)
    149             warnings.simplefilter("error", np.VisibleDeprecationWarning)
    150             try:
--> 151                 y = np.asarray(y)
    152             except np.VisibleDeprecationWarning:
    153                 # dtype=object should be provided explicitly for ragged arrays,

~/miniconda3/envs/rapids-21.10/lib/python3.8/site-packages/cudf/core/frame.py in __array__(self, dtype)
   1636 
   1637     def __array__(self, dtype=None):
-> 1638         raise TypeError(
   1639             "Implicit conversion to a host NumPy array via __array__ is not "
   1640             "allowed, To explicitly construct a GPU array, consider using "

TypeError: Implicit conversion to a host NumPy array via __array__ is not allowed, To explicitly construct a GPU array, consider using cupy.asarray(...)
To explicitly construct a host array, consider using .to_array()

Currently, the data must be on the CPU to use scikit-learn based hyperparameter optimizations tools like GridsearchCV with cuML estimators. The required device/host transfers are likely quite fast relative to your training/predict calls, so this may not be an issue. https://rapids.ai/hpo provides a list of HPO examples using other tools, many of which support using data on the GPU. — Nick Becker, Nov 28 '21 at 18:13
well at least with the current Synthetics dataset (1000 data points 300 variables and 7128 fits) i get a better time on the CPU. (3 min vs 5 min) gonna try other tools again. thanks! — ARandomeUser, Nov 28 '21 at 22:19
With only 1000 rows, you aren't likely to see a benefit. Since you're already using scikit-learn, the easiest way to do faster HPO would be to use random search or bayesian search rather than grid search. — Nick Becker, Nov 29 '21 at 02:16

How do you do a grid search with cuml without a datatype error?

0 Answers0

Linked