cudaMallocHost vs malloc for better performance shows no difference

Question

I have gone through this site. From here I got that pinned memory using cudamallocHost gives better performance than cudamalloc. Then I use two different simple program and tested the execution time as

using cudaMallocHost

#include <stdio.h>
#include <cuda.h>

// Kernel that executes on the CUDA device
__global__ void square_array(float *a, int N)
{
  int idx = blockIdx.x * blockDim.x + threadIdx.x;
  if (idx<N) a[idx] = a[idx] * a[idx];
}

// main routine that executes on the host
int main(void)
{
    clock_t start;
    start=clock();/* Line 8 */
    clock_t finish;
  float *a_h, *a_d;  // Pointer to host & device arrays
  const int N = 100000;  // Number of elements in arrays
  size_t size = N * sizeof(float);
  cudaMallocHost((void **) &a_h, size);
  //a_h = (float *)malloc(size);        // Allocate array on host
  cudaMalloc((void **) &a_d, size);   // Allocate array on device
  // Initialize host array and copy it to CUDA device
  for (int i=0; i<N; i++) a_h[i] = (float)i;
  cudaMemcpy(a_d, a_h, size, cudaMemcpyHostToDevice);
  // Do calculation on device:
  int block_size = 4;
  int n_blocks = N/block_size + (N%block_size == 0 ? 0:1);
  square_array <<< n_blocks, block_size >>> (a_d, N);
  // Retrieve result from device and store it in host array
  cudaMemcpy(a_h, a_d, sizeof(float)*N, cudaMemcpyDeviceToHost);
  // Print results
  for (int i=0; i<N; i++) printf("%d %f\n", i, a_h[i]);
  // Cleanup
  cudaFreeHost(a_h);
  cudaFree(a_d);
  finish = clock() - start;
      double interval = finish / (double)CLOCKS_PER_SEC; 
      printf("%f seconds elapsed", interval);
}

using malloc

#include <stdio.h>
#include <cuda.h>

// Kernel that executes on the CUDA device
__global__ void square_array(float *a, int N)
{
  int idx = blockIdx.x * blockDim.x + threadIdx.x;
  if (idx<N) a[idx] = a[idx] * a[idx];
}

// main routine that executes on the host
int main(void)
{
    clock_t start;
    start=clock();/* Line 8 */
    clock_t finish;
  float *a_h, *a_d;  // Pointer to host & device arrays
  const int N = 100000;  // Number of elements in arrays
  size_t size = N * sizeof(float);
  a_h = (float *)malloc(size);        // Allocate array on host
  cudaMalloc((void **) &a_d, size);   // Allocate array on device
  // Initialize host array and copy it to CUDA device
  for (int i=0; i<N; i++) a_h[i] = (float)i;
  cudaMemcpy(a_d, a_h, size, cudaMemcpyHostToDevice);
  // Do calculation on device:
  int block_size = 4;
  int n_blocks = N/block_size + (N%block_size == 0 ? 0:1);
  square_array <<< n_blocks, block_size >>> (a_d, N);
  // Retrieve result from device and store it in host array
  cudaMemcpy(a_h, a_d, sizeof(float)*N, cudaMemcpyDeviceToHost);
  // Print results
  for (int i=0; i<N; i++) printf("%d %f\n", i, a_h[i]);
  // Cleanup
  free(a_h); cudaFree(a_d);
  finish = clock() - start;
      double interval = finish / (double)CLOCKS_PER_SEC; 
      printf("%f seconds elapsed", interval);
}

here during execution of both program, the execution time was almost similar. Is there anything wrong in the implementation?? what is the exact difference in execution in cudamalloc and cudamallochost??

and also with each run the execution time decreases

You'll hardly see a difference for only 100000 elements. You may want to run the "bandwidth test" CUDA sample from http://docs.nvidia.com/cuda/cuda-samples/#bandwidth-test . It runs a more sophisticated test, where the difference should become visible. — Marco13, Apr 17 '14 at 12:44
the memory transfers are probably only a small fraction of the time you are measuring. use a higher resolution timer and only measure the cudaMemcpy calls and you might notice a difference. alternatively use one of the profiling tools supplied in the CUDA toolkit for your platform. — talonmies, Apr 17 '14 at 13:02
If I take a large value of N, like N=16*1024*1024, the results do not tally as expected. They are not square roots but just the values of "i". Why this anomalous behavior? I have also tried using malloc() instead of cudaHostMalloc(). — MuneshSingh, Jun 06 '16 at 07:18

Robert Crovella · Answer 1 · 2014-04-18T13:57:43.820

If you want to see the difference in execution time for the copy operation, just time the copy operation. In many cases you will see approximately a 2x difference in execution time for just the copy operation when the underlying mememory is pinned. And make your copy operation large enough/long enough so that you are well above the granularity of whatever timing mechanism you are using. The various profilers such as the visual profiler and nvprof can help here.

The cudaMallocHost operation under the hood is doing something like a malloc plus additional OS functions to "pin" each page associated with the allocation. These additional OS operations take extra time, as compared to just doing a malloc. And note that as the size of the allocation increases, the registration ("pinning") cost will generally increase as well.

Therefore, for many examples, just timing the overall execution doesn't show much difference, because while the cudaMemcpy operation may be quicker from pinned memory, the cudaMallocHost takes longer than the corresponding malloc.

So what's the point?

You may be interested in using pinned memory (i.e. cudaMallocHost) when you will be doing repeated transfers from a single buffer. You only pay the extra cost to pin it once, but you benefit on each transfer/usage.
Pinned memory is required to overlap a data transfer operations (cudaMemcpyAsync) with compute activities (kernel calls). Refer to the programming guide.

score 4 · Answer 2 · answered May 06 '15 at 10:13

I too found that just declaring cudaHostAlloc / cudaMallocHost on a piece of memory doesn't do much. To be sure, do a nvprof with --print-gpu-trace and see whether the throughput for memcpyHtoD or memcpyDtoH is good. For PCI2.0, you should get around 6-8gbps.

However, pinned memory is a perquisite for cudaMemcpyAsync. After I called cudaMemcpyAsync, I shifted whatever computations I had on the host right after it. In this way you can "layer" the asynchronous memcpys with the host computations.

I was surprised that I was able to save quite a lot of time this way, it's worth a try.

cudaMallocHost vs malloc for better performance shows no difference

2 Answers2