Yet another bandwidth related question. I expected the plots of Device-to-host bandwidth and that of Host-to-Device to be similar, but I see that there is a significant difference between the two. Considering both following the same route, so the effective bandwidth should be the same, isn't it? The testbed consists of total 12 Intel Westmere CPUs on two sockets, 4 Tesla C2050 GPUs with 4 PCIe Gen2 Express slots. Using the bandwidthtest program from NVidia code samples.
What are the overheads of doing a cudamemCpy from the host vs the device?