I've got a large chunk of generated data (A[i,j,k]) on the device, but I only need one 'slice' of A[i,:,:], and in regular CUDA this could be easily accomplished with some pointer arithmetic.
Can the same thing be done within pycuda? i.e
cuda.memcpy_dtoh(h_iA,d_A+(i*stride))
Obviously this is completely wrong since theres no size information (unless inferred from the dest shape), but hopefully you get the idea?