1

I don't think I'm trying to solve this as much as understand what's going on so I can apply it in the context of my larger project. I am working on rewriting a Python package to run on GPU.

Anyway, I am using cudf and cuml to pass a dataframe to a function that searches a column and makes an array of all the unqiue values. The data passed in is a csv with all numeric fields, and a class/y field containing 1 for the positive class and 0 otherwise.

The loading of the data is nearly identical.

EDIT: It appears that the fact that cudf.unique() returns a Series instead of an array is the cause.

Still weird. The following works fine with a Pandas series object:

data = pd.read_csv(MY_DATA)

l = []

for val in data.MY_COLUMN:
    l.append(val)

print(type(data.MY_COLUMN)
# returns <pandas.core.series.Series>

But the same loop with a <cudf.core.series.Series> gets the TypeError: series objects not iterable error. Why would that be?

The pandas version:

import pandas as pd
from sklearn.model_selection import train_test_split

dataTest = pdf.read_csv(MY_DATA)

X = dataTest.iloc[:, [1,12]]
y = dataTest.iloc[:,12]

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=.2, random_state=610)"

The cudf version:

import cudf
from cuml.model_selection import train_test_split

dataTest = cudf.read_csv(MY_DATA)

X = dataTest.iloc[:, [1,12]]
y = dataTest.iloc[:,12]

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=.2, random_state=610)

Next, I pass X_train and y_train to a model that, as part of a million other things, creates an array of unique values of y_train. As part of that process it uses pandas.unique(). When using the pandas dataframe, no issues, the entire model runs without a hitch (no big surprise, it was built on pandas)

But, when I use the cudf dataframe, I get the following:

File "/home/MASKED/preprocess.py", line 276, in _get_pos_class
    class_values = df[class_feat].unique()
  File "/home/jacob/miniconda3/envs/rapidsAI/lib/python3.7/site-packages/pandas/core/series.py", line 1872, in unique
    result = super().unique()
  File "/home/jacob/miniconda3/envs/rapidsAI/lib/python3.7/site-packages/pandas/core/base.py", line 1047, in unique
    result = unique1d(values)
  File "/home/jacob/miniconda3/envs/rapidsAI/lib/python3.7/site-packages/pandas/core/algorithms.py", line 407, in unique
    uniques = table.unique(values)
  File "pandas/_libs/hashtable_class_helper.pxi", line 4719, in pandas._libs.hashtable.PyObjectHashTable.unique
  File "pandas/_libs/hashtable_class_helper.pxi", line 4666, in pandas._libs.hashtable.PyObjectHashTable._unique
TypeError: unhashable type: 'Series'

I guess I don't understand why it see an unhashable series when accessing the cudf dataframe, but not the pandas-- they both are series? I assumed that what would happen is when the cudf dataframe was called with an method it did not possess, that one of two things would happen:

  • It would send the data back to the CPU to be processed by Pandas; or,
  • It would break by not recognizing the method at all

But what seems to be happening is that it know the cudf dataframe is a dataframe and tries to process it with pandas, but then pandas sees something differently.

I'd like to understand why that is, as I am sure I will be running into a lot more of this.

Thanks for any insight.

datahappy
  • 826
  • 2
  • 11
  • 29
  • Hard to know what's happening without a minimal example, but this may be partially explained by a subtle bug in the cuDF codebase. cuDF's unique is returning a Series, while pandas returns an array. I wouldn't necessarily expect this to be the root cause, though. – Nick Becker May 06 '21 at 16:05
  • That appears to be the issue. In every case where it breaks, the return from unique() is being iterated over. I'm still a bit confused, because in some cases it shouldn't matter that it is a series. Adding to above question to explain. – datahappy May 06 '21 at 18:04
  • cudf series are explicitly not iterable as the performance of direct iteration would be poor. With that said, this is an antipattern in pandas, too. There is usually a way to express the iteration as a set of columnar operations or by using a UDF with Numba. If not, you can call `to_pandas()` to go back to pandas for the iteration – Nick Becker May 06 '21 at 19:35
  • Would it be preferable to go back to pandas vs coverting the series output to a cupy array? (Generally, I know you don't know the specifics of my use case) – datahappy May 07 '21 at 12:06
  • In general, you should avoid explicit Python iteration over a GPU series or array unless you doing it with a Numba.cuda or CuPy kernel. Your original error suggests there may be a Series inside your Series on which you call unique, which is generally an antipattern. If so, there might be a more columnar approach that both abstracts from the need to explicitly iterate and the original error. – Nick Becker May 07 '21 at 13:01

0 Answers0