PyCUDA 2D array implementations (or working with strings)

Question

I'm trying to work with an array of strings(words) in CUDA.

I tried flattening it by creating a single string, but then then to index it, I'd have to go through some of it each time a kernel runs. If there are 9000 words with a length of 6 characters, I'd have to examine 53994 characters in the worst case for each kernel call. So I'm looking for different ways to do it.

Update: Forgot to mention, the strings are of different lengths, so I'd have to find the end of each one.

The next thing I tried was copying each word to different memory locations, and then collect the addresses, and pass that to the GPU as an array with the following code:

# np = numpy

wordList = ['asd','bsd','csd']

d_words = []

for word in wordList:
    d_words.append(gpuarray.to_gpu(np.array(word, dtype=str)))

d_wordList = gpuarray.to_gpu(np.array([word.ptr for word in d_words], dtype=np.int32))

ker_test(d_wordList, block=(1,1,1), grid=(1,1,1))

and in the kernel:

__global__ void test(char** d_wordList) {
    printf("First character of the first word is: %c \n", d_wordList[0][0]);
}

The kernel should get an int32 array of pointers that point to the beginning of each word, effectively being a char** (or int**), but it doesn't work as I expect.

What is wrong with this approach?

Also what are the "standard" ways to work with strings in PyCUDA (or even in CUDA) in general?

Thanks in advance.

it's not at all clear why flattening the strings would cause algorithmic difficulty. I would call that a "standard" method. It's also not clear why you think an array of pointers would fit in a int32 array. Are you working on a 32-bit OS? — Robert Crovella, Jan 02 '18 at 17:12
Forgot to mention that the strings weren't the same length, so I'd have to examine each character if was the end of a string or not for each "index". This looked tedious both to write and probably to compute as well. So the second approach would be somewhat easier to work with, I don't know about the performance difference it would make though. Ah, the int32 could be an issue, for some reason thought it would be enough. — tamasfe, Jan 02 '18 at 17:16
In addition to passing a packed array of the strings, pass a packed array of the starting **offsets** (not pointers) of each string. It should be easy to assemble that as you are packing the strings, and it should be easy to pass (as a int32 array) and it should be easy (and more efficient) to use that to do pointer-based access within the CUDA kernel itself, rather than using the double-pointer method you are discussing here. The length of each string is just the difference between one index/offset and the next. — Robert Crovella, Jan 02 '18 at 17:22
Thank you! The pointer size was the problem indeed, I have no idea why I thought memory addresses would be 32bits. Also passing the array offsets idea looks pretty handy, and is somehing I should consider as well. So now I have two options to choose from with all problems solved. I wouldn't want to delete the question though in case someone struggles with similar problems, so if you post an answer, then I can "close" it properly. — tamasfe, Jan 02 '18 at 17:56
Also, the offset array method appears to be slightly slower in my implementation, so I'll probably stick to the 2D array one, I don't know the disadvantages of it though yet. — tamasfe, Jan 02 '18 at 19:06
To be correct you should be using `numpy.uintp` as the dtype for pointers. Internally that is how PyCUDA does it — talonmies, Jan 02 '18 at 19:48

score 1 · Accepted Answer · answered Jan 03 '18 at 00:24

After some further thought, I've concluded that for this case of variable-length strings, using an "offset array" may not be much different than 2D indexing (i.e. double-pointer indexing), when considering the issue of data access within the kernel. Both involve a level of indirection.

Here's a worked example demonstrating both methods:

$ cat t5.py
#!python
#!/usr/bin/env python
import time
import numpy as np
from pycuda import driver, compiler, gpuarray, tools
import math
from sys import getsizeof

import pycuda.autoinit

kernel_code1 = """
__global__ void test1(char** d_wordList) {
      (d_wordList[blockIdx.x][threadIdx.x])++;
}
    """

kernel_code2 = """
__global__ void test2(char* d_wordList, size_t *offsets) {
    (d_wordList[offsets[blockIdx.x] + threadIdx.x])++;
}
    """




mod = compiler.SourceModule(kernel_code1)
ker_test1 = mod.get_function("test1")



wordList = ['asd','bsd','csd']

d_words = []

for word in wordList:
    d_words.append(gpuarray.to_gpu(np.array(word, dtype=str)))

d_wordList = gpuarray.to_gpu(np.array([word.ptr for word in d_words], dtype=np.uintp))

ker_test1(d_wordList, block=(3,1,1), grid=(3,1,1))

for word in d_words:
  result = word.get()
  print result

mod2 = compiler.SourceModule(kernel_code2)
ker_test2 = mod2.get_function("test2")
wordlist2 = np.array(['asdbsdcsd'], dtype=str)
d_words2 = gpuarray.to_gpu(np.array(['asdbsdcsd'], dtype=str))
offsets = gpuarray.to_gpu(np.array([0,3,6,9], dtype=np.uint64))
ker_test2(d_words2, offsets, block=(3,1,1), grid=(3,1,1))
h_words2 = d_words2.get()
print h_words2


$ python t5.py
bte
cte
dte
['btectedte']
$

Notes:

for the double-pointer case, the only change from OP's example was to use the numpy.uintp type for the pointer as suggested in the comments by @talonmies
I don't think the double-pointer access of data will necessarily be quicker or slower than the indirection associated with the offset lookup method. One other performance consideration would be in the area of copying data from host to device and vice versa. The double pointer method effectively involves multiple allocations and multiple copy operations, in both directions, I believe. For a lot of strings, this will be noticeable in the host/device data copy operations.
Another possible merit of the offset method is that it is easy to determine the length of each string - just subtract two adjacent entries in the offset list. This could be useful so as to make it easy to determine how many threads can operate on a string in parallel, as opposed to having a single thread work on a string sequentially (or use a method in kernel code to determine string length, or pass the length of each string).

Thanks again for the detailed answer. After some testing I found out that the offset array method is the way to go indeed. The char** method becomes unusably slow when working with large data due to the memory operations. (At least the way I manged to implement it) — tamasfe, Jan 03 '18 at 16:55

PyCUDA 2D array implementations (or working with strings)

1 Answers1