PyCUDA misaligned address cleanup failure

Question

I am very new to CUDA programming and am starting off with PyCUDA to get the basics. I studied the tutorials and have run a couple of simple test codes. The tests used only 1D arrays. When I tried to run the following code with 2D arrays, I am continuously getting a PyCUDA warning saying that the clean-up operation failed due to misaligned address.

import pycuda.autoinit
import pycuda.driver as drv
import numpy as np
from pycuda.compiler import SourceModule

mod = SourceModule("""
    __global__ void multiply(float **dest) {
        const int i = threadIdx.x;
        const int j = threadIdx.y;
        dest[i][j] = 2.0*dest[i][j];
    }
""")

a = np.random.randn(32, 32).astype(np.float32)
multiply = mod.get_function("multiply")
multiply(drv.InOut(a), block=(32,32,1), grid=(1,1))

print(a)

The error that I get when I run the above script is:

Traceback (most recent call last):
  File "cudaTest.py", line 16, in <module>
    multiply(drv.InOut(a), block=(32,32,1), grid=(1,1))
  File "/users/gpu/local/python3.3/lib/python3.6/site-packages/pycuda-2016.1.2-py3.6-linux-x86_64.egg/pycuda/driver.py", line 405, in function_call
    Context.synchronize()
pycuda._driver.LogicError: cuCtxSynchronize failed: misaligned address
PyCUDA WARNING: a clean-up operation failed (dead context maybe?)
cuMemFree failed: misaligned address
PyCUDA WARNING: a clean-up operation failed (dead context maybe?)
cuModuleUnload failed: misaligned address

I have checked other questions on SO and found a similar one. Following the answer given there, I tried specifying the size of the array a, but to no avail.

I am running this on a cluster with two nVidia Tesla K10 GPUs. Since I have no root access, I had to install Python3 locally and add numpy, pyCUDA etc to the local installation. The cluster runs on Ubuntu 12.04.1 LTS. I am using Python 3.6.0 with PyCUDA 2016.1.2 and CUDA 6.0

talonmies · Accepted Answer · 2017-01-03T09:05:41.290

The problem here is that your understanding of what constitutes a "2D array" is incorrect. Numpy arrays (and by extension PyCUDA gpuarrays) are stored in pitched linear memory in row major order by default. Your kernel has been written to expect an array of pointers as in input, and is attempting to use floating point data as addresses, leading to the runtime addressing error you are seeing.

To correct your kernel to work with the array, you would need to modify it to something like:

mod = SourceModule("""
    __global__ void multiply(float *dest, int lda) {
        const int i = threadIdx.x;
        const int j = threadIdx.y;
        float *p = &dest[i * lda + j]; // row major order
        *p *= 2.0f;
    }
""")

Note that the array is passed as a pointer to a pitched linear allocation, not an array of row pointers. Because of this, you will need to pass the pitch of the array in elements to the kernel as well, so that the calling PyCUDA host code looks like:

N = 8
a = np.random.randn(N, N).astype(np.float32)
print(a)
multiply = mod.get_function("multiply")
lda = np.int32(N)
multiply(drv.InOut(a), lda, block=(N,N,1), grid=(1,1))
print(a)

You should find this will now work correctly.

PyCUDA misaligned address cleanup failure

1 Answers1