I followed the example on this page to get started with CUDA programming. It uses addition of two arrays with a million elements each for illustration with different execution configurations.
I used a Tesla P100 (Pascal architecture) to run the code using Google Colaboratory. But the article uses a K80. Here are the metrics from nvprof on executing the same code in both these GPUs.
+--------------------+-------------------------+----------+
| GPU | Execution configuration | Time |
+--------------------+-------------------------+----------+
| K80 | <<<1, 256>>> | 2.7107ms |
+--------------------+-------------------------+----------+
| Tesla-P100(Pascal) | <<<1, 256>>> | 4.4293ms |
+--------------------+-------------------------+----------+
| K80 | <<<4096, 256>>> | 94.015us |
+--------------------+-------------------------+----------+
| Tesla-P100(Pascal) | <<<4096, 256>>> | 3.6076ms |
+--------------------+-------------------------+----------+
After reading this article, I was under the assumption that the Pascal architecture would outperform the K80. But as seen above there are two observations:
- The K80 is faster than Pascal for single block performance
- Using 4096 blocks instead of 1 on the K80 produces significant performance gains (~28x), but this is not seen in the case of Pascal (~1.2x)
Is this expected? Also, what would be the explanation for the observation (2)?
Please let me know if I am missing something here.
Thank you for reading.