0

I have a integer matrix of size 100x200x800 which is stored on the host in a flat 100*200*800 vector, i.e., I have

int* h_data = (int*)malloc(sizeof(int)*100*200*800);

On the device (GPU), I want to pad each dimension with zeros such that I obtain a matrix of size 128x256x1024, allocated as follows:

int *d_data;
cudaMalloc((void**)&d_data, sizeof(int)*128*256*1024);

What is the best approach to obtain the zero-padded matrix? I have two ideas:

  1. Iterate through individual submatrices on the host and copy them directly to the correct location on the device.
    • This approach requires many cudaMemcpy calls and is thus likely to be very slow
  2. On the device, allocate memory for a 100x200x800 matrix and a 128x256x1024 matrix and write a kernel that copies the samples to the correct memory space
    • This approach is probably much faster but requires allocating memory for two matrices on the device

Is there any possibility for three-dimensional matrix indexing similar to MATLAB? In MATLAB, I could simply do the following:

h_data = rand(100, 200, 800);
d_data = zeros(128, 256, 1024);
d_data(1:100, 1:200, 1:800) = h_data;

Alternatively, if I copy the data to the device using cudaMemcpy(d_data, h_data, sizeof(int)*100*200*800, cudaMemcpyHostToDevice);, is it possible to reorder data in place such that I do not have to allocate memory for a second matrix, maybe using cudaMemcpy3D or cudaMemset3D?

brnk
  • 187
  • 9
  • The answer to the first question totally depends on the planned usage on GPU: how will you address the array (sparsely, linearly, etc.) , how many times will you read and write, etc. – Damir Tenishev Oct 12 '21 at 22:11
  • I want to perform a FFT on the zero-padded matrix using the cuFFT library and copy the data back to the host. This should be performed continuously on a stream, i.e., as soon as the FFT is computed, the next matrix should be processed. I'm new to CUDA and I couldn't find any information about sparse or linear memory. The matrix itself is not sparse. I hope this information helps. – brnk Oct 13 '21 at 12:27
  • The second approach would be preferred for most factors of consideration that I can think of. The extra allocation here is 64MB in size. That's a problem? No, you cannot use matlab style matrix slicing notation to refer to or populate matrices in C++, which is what CUDA is based on. – Robert Crovella Oct 13 '21 at 15:14

1 Answers1

1

As you hypothesize, you can use cudaMemcpy3D for this operation. Basically:

  1. Allocate your device array as normal
  2. Zero it with cudaMemset
  3. Use cudaMemcpy3D to perform a linear memory copy from host to device for the selected subarray from the host source to the device destination array.

The cudaMemcpy3D API is a bit baroque, cryptically documented, and has a few common traps for beginners. Basically, linear memory transfers require a pitched pointer for both the source and destination, and a extent denoting the size of the transfer. The confusing part is that the argument meanings change depending on whether the source and/or destination memory is a CUDA array or pitched linear memory. In code you will want something like this:

int hw = 100, hh = 200, hd = 800; 
size_t hpitch = hw * sizeof(int);
int* h_data = (int*)malloc(hpitch * hh * hd);

int dw = 128, dh = 256, dd = 1024;
size_t dpitch = dw * sizeof(int);
int *d_data; 
cudaMalloc((void**)&d_data, dpitch * dh * dd);
cudaMemset(d_data, 0, dpitch * dh * dd);

cudaPitchedPtr src = make_cudaPitchedPtr(h_data, hpitch, hw, hh);    ​
​cudaPitchedPtr dst = make_cudaPitchedPtr(d_data, dpitch, dw, dh);

cudaExtent copyext = make_cudaExtent(hpitch, hh, hd);

​‎cudaMemcpy3DParms copyparms = {0};
​copyparms.srcPtr = src;
​copyparms.dstPtr = dest;
copyparms.extent = copyext;
copyparms.kind = cudaMemcpyHostToDevice;

cudaMemcpy3D(&copyparms);

[Note: all done in the browser, never compiled or run use at own risk]

talonmies
  • 70,661
  • 34
  • 192
  • 269
  • Thank you so much. This was exactly what I was looking for. You could edit your answer and change `copyparms.desPtr` to `copyparms.dstPtr`. Everything else was working immediately. – brnk Oct 14 '21 at 14:20