Why would the code from PyCuda KernelConcurrency Example not run faster in 'concurrent' mode? It seems like there should be enough resources on my GPU... what am I missing?
Here is the output from the 'concurrent' version, with line 63 uncommented:
=== Device attributes
Name: GeForce GTX 980
Compute capability: (5, 2)
Concurrent Kernels: True
=== Checking answers
Dataset 0 : passed.
Dataset 1 : passed.
=== Timing info (for last set of kernel launches)
Dataset 0
kernel_begin : 1.68524801731
kernel_end : 1.77305603027
Dataset 1
kernel_begin : 1.7144639492
kernel_end : 1.80246400833
Here is the version with line 63 commented out. This should be no longer running concurrently, and should be significantly slower. It looks nearly the same to me (about 0.08 - 0.09 in both cases):
=== Device attributes
Name: GeForce GTX 980
Compute capability: (5, 2)
Concurrent Kernels: True
=== Checking answers
Dataset 0 : passed.
Dataset 1 : passed.
=== Timing info (for last set of kernel launches)
Dataset 0
kernel_begin : 1.20230400562
kernel_end : 1.28966403008
Dataset 1
kernel_begin : 1.21827197075
kernel_end : 1.30672001839
Is there something I'm missing here? Is there another way to test concurrency?