1

There are a few different forms of task parallelism that can be exploited with CUDA. we could execute copying memory back and forth in parallel with kernel execution. in this case we have to allocate host memory as pinned memory using cudaHostAlloc and streams can be used to execute thing in parallel. But if i am only interested in running a few kernels in parallel to each other using streams do i have to make use of pinned memory or can i use the normal unpinned memory, (that is use malloc)?

Thank you,

shadow
  • 141
  • 1
  • 7

1 Answers1

2

As long as you invoke kernels in separate streams, CUDA will try to run those kernels in parallel.

Kernels can only reference memory for which there is a device memory address; the only way you can get a device memory address for host memory is by allocating it as mapped pinned memory. This happens automatically if UVA is enabled; otherwise you have to call cudaSetDeviceFlags() with cudaDeviceMapHost, and call cudaHostAlloc() with the cudaHostAllocMapped flag.

So if the goal is for kernels that are running concurrently to reference pinned memory, then the answer to your question is Yes, you must use pinned memory and that memory must be mapped.

ArchaeaSoftware
  • 4,332
  • 16
  • 21