What is the reason for K80 versus Pascal performance differences in this program that adds two arrays?

Question

I followed the example on this page to get started with CUDA programming. It uses addition of two arrays with a million elements each for illustration with different execution configurations.

I used a Tesla P100 (Pascal architecture) to run the code using Google Colaboratory. But the article uses a K80. Here are the metrics from nvprof on executing the same code in both these GPUs.

+--------------------+-------------------------+----------+
| GPU                | Execution configuration | Time     |
+--------------------+-------------------------+----------+
| K80                | <<<1, 256>>>            | 2.7107ms |
+--------------------+-------------------------+----------+
| Tesla-P100(Pascal) | <<<1, 256>>>            | 4.4293ms |
+--------------------+-------------------------+----------+
| K80                | <<<4096, 256>>>         | 94.015us |
+--------------------+-------------------------+----------+
| Tesla-P100(Pascal) | <<<4096, 256>>>         | 3.6076ms |
+--------------------+-------------------------+----------+

After reading this article, I was under the assumption that the Pascal architecture would outperform the K80. But as seen above there are two observations:

The K80 is faster than Pascal for single block performance
Using 4096 blocks instead of 1 on the K80 produces significant performance gains (~28x), but this is not seen in the case of Pascal (~1.2x)

Is this expected? Also, what would be the explanation for the observation (2)?

Please let me know if I am missing something here.

Thank you for reading.

The code uses managed memory (unified memory). The behavior of managed memory is substantially different between K80 and P100. In the K80 case, none of the managed memory overhead associated with data movement for the kernel launch is showing up in the kernel duration measurement. In the P100 case, all of the overhead for data movement to the device is showing up in kernel duration. Your question is essentially a duplicate of [this one](https://stackoverflow.com/questions/39782746/why-is-nvidia-pascal-gpus-slow-on-running-cuda-kernels-when-using-cudamallocmana) — Robert Crovella, Feb 27 '20 at 17:56
@Robert Crovella thank you for pointing that out. At the time of posting, I was not aware of the overhead. Correct me if I'm wrong, but I now understand that there is some degree of coupling between GPU H/W architecture and CUDA enhancements (the article says Pascal better supports Unified memory). Would it be possible to run an older version of CUDA that did not implement unified memory to see different performance characteristics? — Rajesh Shashi Kumar, Feb 27 '20 at 18:10
The older version of CUDA in that case won't have direct support for Pascal. And you'd have to rewrite the code. If you want to eliminate the unified memory effect, the code can be rewritten to use typical cuda with host `malloc` and device `cudaMalloc`, rather than `cudaMallocManaged`, and the discrepancies will go away. You don't need to switch to a different CUDA version. You might wish to study CUDA managed memory to learn how it differs from non-managed memory usage. — Robert Crovella, Feb 27 '20 at 18:16

What is the reason for K80 versus Pascal performance differences in this program that adds two arrays?

0 Answers0