I don't think I'm trying to solve this as much as understand what's going on so I can apply it in the context of my larger project. I am working on rewriting a Python package to run on GPU.
Anyway, I am using cudf and cuml to pass a dataframe to a function that searches a column and makes an array of all the unqiue values. The data passed in is a csv with all numeric fields, and a class/y field containing 1 for the positive class and 0 otherwise.
The loading of the data is nearly identical.
EDIT: It appears that the fact that cudf.unique() returns a Series instead of an array is the cause.
Still weird. The following works fine with a Pandas series object:
data = pd.read_csv(MY_DATA)
l = []
for val in data.MY_COLUMN:
l.append(val)
print(type(data.MY_COLUMN)
# returns <pandas.core.series.Series>
But the same loop with a <cudf.core.series.Series> gets the TypeError: series objects not iterable error. Why would that be?
The pandas version:
import pandas as pd
from sklearn.model_selection import train_test_split
dataTest = pdf.read_csv(MY_DATA)
X = dataTest.iloc[:, [1,12]]
y = dataTest.iloc[:,12]
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=.2, random_state=610)"
The cudf version:
import cudf
from cuml.model_selection import train_test_split
dataTest = cudf.read_csv(MY_DATA)
X = dataTest.iloc[:, [1,12]]
y = dataTest.iloc[:,12]
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=.2, random_state=610)
Next, I pass X_train and y_train to a model that, as part of a million other things, creates an array of unique values of y_train. As part of that process it uses pandas.unique(). When using the pandas dataframe, no issues, the entire model runs without a hitch (no big surprise, it was built on pandas)
But, when I use the cudf dataframe, I get the following:
File "/home/MASKED/preprocess.py", line 276, in _get_pos_class
class_values = df[class_feat].unique()
File "/home/jacob/miniconda3/envs/rapidsAI/lib/python3.7/site-packages/pandas/core/series.py", line 1872, in unique
result = super().unique()
File "/home/jacob/miniconda3/envs/rapidsAI/lib/python3.7/site-packages/pandas/core/base.py", line 1047, in unique
result = unique1d(values)
File "/home/jacob/miniconda3/envs/rapidsAI/lib/python3.7/site-packages/pandas/core/algorithms.py", line 407, in unique
uniques = table.unique(values)
File "pandas/_libs/hashtable_class_helper.pxi", line 4719, in pandas._libs.hashtable.PyObjectHashTable.unique
File "pandas/_libs/hashtable_class_helper.pxi", line 4666, in pandas._libs.hashtable.PyObjectHashTable._unique
TypeError: unhashable type: 'Series'
I guess I don't understand why it see an unhashable series when accessing the cudf dataframe, but not the pandas-- they both are series? I assumed that what would happen is when the cudf dataframe was called with an method it did not possess, that one of two things would happen:
- It would send the data back to the CPU to be processed by Pandas; or,
- It would break by not recognizing the method at all
But what seems to be happening is that it know the cudf dataframe is a dataframe and tries to process it with pandas, but then pandas sees something differently.
I'd like to understand why that is, as I am sure I will be running into a lot more of this.
Thanks for any insight.