0

I want to implement padding operation to each list after collecting with groupby operation.

The conceptual implementation is like this:

df = cudf.DataFrame({"g": [1, 1, 1, 2, 2, 3], "a": [1, 2, 3, 1, 3, 1]})
df.groupby("g")["a"].collect().list.pad(max_length=3, pad_left=True, drop="last", padding_value=-1)

expected output:

g
1      [1, 2, 3]
2     [-1, 1, 3]
3    [-1, -1, 1]

How to do this?

After converting pandas dataframe and applying `np.pad` operation worked, but it seems a bit awkward and slow. Are there any way to do it in cuDF/cuPy?

cudf.from_pandas(
    df.groupby("g")["a"]
    .collect()
    .to_pandas()
    .apply(lambda x: np.pad(x, (max(3 - len(x), 0), 0), constant_values=(-1,)))
)

c.f. .apply() function to cuDF series of list type raises NumbaNotImplementedError.

NumbaNotImplementedError: list
df = cudf.DataFrame({"g": [1, 1, 1, 2, 2, 3], "a": [1, 2, 3, 1, 3, 1]})
df.groupby("g")["a"].collect().apply(
    lambda x: np.pad(x, (max(3 - len(x), 0), 0), constant_values=(-1,))
)
bilzard
  • 13
  • 1
  • 4

1 Answers1

0

This question was answered in RAPIDS cuDF github repo, and I'm just closing the loop.

Nick Becker shared this link to NVTabular, which the OP was using, which demonstrates how to pad the list column using nvtabular.ops.ListSlice's pad and pad_value parameters.

TaureanDyerNV
  • 1,208
  • 8
  • 9
  • True, this problem is more suited for other preprocessing package like NVTabular, however, can I keep this post in case someone want to handle the same processing with cuDF? – bilzard Jan 11 '23 at 05:02
  • yeah keep it, because i hope this will direct them properly :) – TaureanDyerNV Jan 23 '23 at 23:13