1

I am new to cuDF and may not have understood the purpose of construct so this is a very generic question that I have. I have a dataset that has mostly string columns and I was hoping to use apply_rows to perform the processing of the strings, however, I realized that this may only work with numeric data.

Here is an example that I quoted in most sites:

import cudf
import numpy as np

df = cudf.DataFrame()
nelem = 3
df['col1'] = np.arange(nelem)
df['col2'] = np.arange(nelem)
df['col3'] = np.arange(nelem)

# Define input columns for the kernel
col1 = df['col1']
col2 = df['col2']
col3 = df['col3']

def kernel(col1, col2, col3, out1,  out2, kwarg1, kwarg2):
    for i, (x, y, z) in enumerate(zip(col1, col2, col3)):
        out1[i] = kwarg2 * x - kwarg1 * y
        out2[i] = y - kwarg1 * z
df.apply_rows(kernel,
              incols=['col1', 'col2', 'col3'],
              outcols=dict(out1=np.float64),
              kwargs=dict(kwarg1=3, kwarg2=4))

If I change this to

import cudf
import numpy as np

df = cudf.DataFrame()
nelem = 3
df['col1'] = np.arange(nelem)
df['col2'] = np.arange(nelem)
df['col3'] = ['a','a','a'] # <<- change to string

# Define input columns for the kernel
col1 = df['col1']
col2 = df['col2']
col3 = df['col3']

def kernel(col1, col2, col3, out1,  out2, kwarg1, kwarg2):
    for i, (x, y, z) in enumerate(zip(col1, col2, col3)):
        out1[i] = kwarg2 * x - kwarg1 * y
        out2[i] = y - kwarg1 * z

It reports an error like AttributeError: 'nvstrings' object has no attribute 'to_gpu_array'.

Is this designed to work only with numerical values? I am assuming this is designed to work on matrix type operations which is why this constraint. Can someone provide some insights here?

Mayukh
  • 117
  • 1
  • 4

3 Answers3

1

@Mayukh, as @rnyai said, you're using apply rows and UDFs in a way that they won't work in RAPIDS. String processing can be done slightly differently. RAPIDS has a string accessor that uses nvstrings to process strings in a GPU efficient way.

I'm not sure what operation you're looking to do with your example in your question, but here is a link to our code, for reference. I'll link you to the docs later below.

https://github.com/rapidsai/cudf/blob/branch-0.14/python/cudf/cudf/core/column/string.py

for instance, if you wanted to make your strings uppercase,

import cudf
import numpy as np

df = cudf.DataFrame()
nelem = 3
df['col1'] = np.arange(nelem)
df['col2'] = np.arange(nelem)
df['col3'] = ['a','a','a'] # <<- change to string
df['col3'] = df['col3'].str.upper()
df.head()

there are more operations that you can do here: https://docs.rapids.ai/api/nvstrings/stable/

From there, you can create regular functions that process the strings with the expected GPU speed up. Just keep your code parallel! For loops are serial and RAPIDS does a lot of heavy lifting for you.

TaureanDyerNV
  • 1,208
  • 8
  • 9
0

UDFs on string column are not yet supported. You can follow the open GitHub issues here:

https://github.com/rapidsai/cudf/issues/2169

https://github.com/rapidsai/cudf/issues/3646

rnyai
  • 25
  • 3
  • Thanks @myai. So it does make sense to support numerical values as well? My question was to clarify my understanding of cuDF. So this indeed can be seen as a construct to parallelize data frame operations, regardless of what type of operations it is? e.g. https://github.com/nalepae/pandarallel does parallelize data frame operations too. How does it compare with cuDF? – Mayukh Mar 31 '20 at 13:25
0

When you execute apply_rows you are executing a User Defined Function (UDF) to the rows of the columns you are passing it in. Right now in the current version of cuDF, string columns are a very different type of object than numerical columns, and as @rnyai mentions, you can not execute a UDF on a string column.

Right now cuDF is undergoing a large transition (libcudf++) where String columns are getting re-architected and should soon support UDFs. Keep an eye on the issues mentioned by @rnyai to see when cuDF string columns will support UDFs.

In the meantime, I would suggest you only use apply_rows for your numerical columns and see if there is another way you can do what you need to do to your string columns. Perhaps if you post here what you are trying to achieve, we can suggest some solutions.

bitsalsa
  • 31
  • 5