0

As per the title, I am trying to combine the row values from different cudf.DataFrame columns. The following code works for a standard pandas.DataFrame:

import pandas as pd
data = {'a': [1], 'b': [2], 'c': [3], 'd': [4]}
df = pd.DataFrame.from_dict(data)

def f(row):
    return {'dictfromcolumns': [row['a'], row['b'], row['c'], row['d']]}

df['new'] = df.apply(f, axis=1)

The equivalent code with cudf, should look like:

dfgpu = cudf.DataFrame(df)
dfgpu['new'] = dfgpu.apply(f, axis=1)

But this will throw the following ValueError exception:

ValueError: user defined function compilation failed.

Is there an alternative way to accomplish the combination of cudf columns (in my case I need to create a dict and store it as the value in a new column)

Thanks!

epifanio
  • 1,228
  • 1
  • 16
  • 26

1 Answers1

1

pandas allows storing arbitrary data structures inside columns (such as a dictionary of lists, in your case). cuDF does not. However, cuDF provides an explicit data type called struct, which is common in big data processing engines and may be want you want in this case.

Your UDF is failing because Numba.cuda doesn't understand the dictionary/list data structures.

The best way to do this is to first collect your data into a single column as a list (cuDF also provides an explicit list data type). You can do this by melting your data from wide to long (and adding a key column to keep track of the original rows) and then doing a groupby collect operation. Then, create the struct column.

import pandas as pd
import cudf
import numpy as np

data = {'a': [1, 10], 'b': [2, 11], 'c': [3, 12], 'd': [4, 13]}
df = pd.DataFrame.from_dict(data)

gdf = cudf.from_pandas(df)
gdf["key"] = np.arange(len(gdf))

melted = gdf.melt(id_vars=["key"], value_name="struct_key_name") # wide to long format
gdf["new"] = melted.groupby("key").collect()[["struct_key_name"]].to_struct()
gdf
    a   b   c   d   key     new
0   1   2   3   4   0   {'struct_key_name': [1, 4, 2, 3]}
1   10  11  12  13  1   {'struct_key_name': [10, 13, 11, 12]}

Note that the struct column in cuDF is not the same as "a dictionary in a column". It's a much more efficient, explicit type meant for storing and manipulating columnar {key : value} data. cuDF provides a "struct accessor" to manipulate structs, which you can access at df[col].struct.XXX. It currently supports selecting individual fields (keys) and the explode operation. You can also carry structs around in other operations (including I/O).

Nick Becker
  • 4,059
  • 13
  • 19
  • building on top of this question, how to run a `gdf.groupby("column".agg({'field': list}))` when the field is of type `cudf.struct`? - I've tried, trying to aggregate on the "new" field - I got the following exception: `DataError: All requested aggregations are unsupported.` is there any alternative way to do the same? I can open a new SO Q. if needed. – epifanio Sep 26 '22 at 07:54