I have a integer matrix of size 100x200x800 which is stored on the host in a flat 100*200*800 vector, i.e., I have
int* h_data = (int*)malloc(sizeof(int)*100*200*800);
On the device (GPU), I want to pad each dimension with zeros such that I obtain a matrix of size 128x256x1024, allocated as follows:
int *d_data;
cudaMalloc((void**)&d_data, sizeof(int)*128*256*1024);
What is the best approach to obtain the zero-padded matrix? I have two ideas:
- Iterate through individual submatrices on the host and copy them directly to the correct location on the device.
- This approach requires many
cudaMemcpy
calls and is thus likely to be very slow
- This approach requires many
- On the device, allocate memory for a 100x200x800 matrix and a 128x256x1024 matrix and write a kernel that copies the samples to the correct memory space
- This approach is probably much faster but requires allocating memory for two matrices on the device
Is there any possibility for three-dimensional matrix indexing similar to MATLAB? In MATLAB, I could simply do the following:
h_data = rand(100, 200, 800);
d_data = zeros(128, 256, 1024);
d_data(1:100, 1:200, 1:800) = h_data;
Alternatively, if I copy the data to the device using cudaMemcpy(d_data, h_data, sizeof(int)*100*200*800, cudaMemcpyHostToDevice);
, is it possible to reorder data in place such that I do not have to allocate memory for a second matrix, maybe using cudaMemcpy3D
or cudaMemset3D
?