I am curious if cudaMemcpy is executed on the CPU or the GPU when we copy from host to device?
I other words, it the copy a sequential process or is it done in parallel?
Let me explain why I ask this: I have an array of 5 million elements . Now, I want to copy 2 sets of 50,000 elements from different parts of the array. SO, i was thinking will it be faster to first form a large array of all the elements i want to copy on the CPU and then do just 1 large transfer or should i just call 2 cudaMemcpy, one for each set.
If cudaMemcpy is done in parallel, then i think the 2nd approach will be faster as you dont have to copy 100000 elements sequentially on the CPU first