I am working with CUDA and 3D textures in python (using pycuda). There is a function called Memcpy3D which has the same members as Memcpy2D plus a few extras. In it it calls you to describe things such as width_in_bytes
, src_pitch
, src_height
, height
and copy_depth
. This is what I am struggling with (in 3D) and its relevance with C or F style indexing. For instance, if I simply change the ordering from F to C in the working example below, it stops working - and I don't know why.
- First of all, I understand pitch to be how many bytes in memory it takes to move one index across in
threadIdx.x
(or the x direction, or a column). So for a float32 array of C shape (3,2,4), to move one value in x I expect to move 4 values in memory (as the indexing goes down the z axis first?). Therefore my pitch would be 4*32bits. - I understand
height
to be the number of rows. (In this example, 3) - I understand
width
to be the number of cols. (In this example, 2) - I understand
depth
to be the number of z slices. (In this example, 4) - I understand
width_in_bytes
to be the width of a row in x inclusive of the z elements behind it, i.e. a row slice, (0,:,:). This would be how many addresses in memory it takes to transverse one element in the y-direction.
So when I change the ordering from F to C in the code below, and adapt the code to change the height/width values accordingly it still doesn't work. It just presents a logic failure which makes me think I'm not understanding the concept of pitch, width, height, depth correctly.
Please educate me.
Below is a full working script that copies an array to the GPU as a texture and copies the contents back.
import pycuda.driver as drv
import pycuda.gpuarray as gpuarray
import pycuda.autoinit
from pycuda.compiler import SourceModule
import numpy as np
w = 2
h = 3
d = 4
shape = (w, h, d)
a = np.arange(24).reshape(*shape,order='F').astype('float32')
print(a.shape,a.strides)
print(a)
descr = drv.ArrayDescriptor3D()
descr.width = w
descr.height = h
descr.depth = d
descr.format = drv.dtype_to_array_format(a.dtype)
descr.num_channels = 1
descr.flags = 0
ary = drv.Array(descr)
copy = drv.Memcpy3D()
copy.set_src_host(a)
copy.set_dst_array(ary)
copy.width_in_bytes = copy.src_pitch = a.strides[1]
copy.src_height = copy.height = h
copy.depth = d
copy()
mod = SourceModule("""
texture<float, 3, cudaReadModeElementType> mtx_tex;
__global__ void copy_texture(float *dest)
{
int x = threadIdx.x;
int y = threadIdx.y;
int z = threadIdx.z;
int dx = blockDim.x;
int dy = blockDim.y;
int i = (z*dy + y)*dx + x;
dest[i] = tex3D(mtx_tex, x, y, z);
}
""")
copy_texture = mod.get_function("copy_texture")
mtx_tex = mod.get_texref("mtx_tex")
mtx_tex.set_array(ary)
dest = np.zeros(shape, dtype=np.float32, order="F")
copy_texture(drv.Out(dest), block=shape, texrefs=[mtx_tex])
print(dest)