There are a few different forms of task parallelism that can be exploited with CUDA. we could execute copying memory back and forth in parallel with kernel execution. in this case we have to allocate host memory as pinned memory using cudaHostAlloc and streams can be used to execute thing in parallel. But if i am only interested in running a few kernels in parallel to each other using streams do i have to make use of pinned memory or can i use the normal unpinned memory, (that is use malloc)?
Thank you,