0

Why would the code from PyCuda KernelConcurrency Example not run faster in 'concurrent' mode? It seems like there should be enough resources on my GPU... what am I missing?

Here is the output from the 'concurrent' version, with line 63 uncommented:

=== Device attributes
Name: GeForce GTX 980
Compute capability: (5, 2)
Concurrent Kernels: True

=== Checking answers
Dataset 0 : passed.
Dataset 1 : passed.

=== Timing info (for last set of kernel launches)
Dataset 0
kernel_begin : 1.68524801731
kernel_end : 1.77305603027
Dataset 1
kernel_begin : 1.7144639492
kernel_end : 1.80246400833

Here is the version with line 63 commented out. This should be no longer running concurrently, and should be significantly slower. It looks nearly the same to me (about 0.08 - 0.09 in both cases):

=== Device attributes
Name: GeForce GTX 980
Compute capability: (5, 2)
Concurrent Kernels: True

=== Checking answers
Dataset 0 : passed.
Dataset 1 : passed.

=== Timing info (for last set of kernel launches)
Dataset 0
kernel_begin : 1.20230400562
kernel_end : 1.28966403008
Dataset 1
kernel_begin : 1.21827197075
kernel_end : 1.30672001839

Is there something I'm missing here? Is there another way to test concurrency?

talonmies
  • 70,661
  • 34
  • 192
  • 269
Alex Hall
  • 171
  • 11
  • 1
    I don't understand what you are trying to ask here. What does the first paragraph have to do with the rest of the question? Are you just really asking about that PyCUDA example? If so, what is it want to know? Why should a non-concurrent version be significantly slower? Do you know whether concurrent execution is *ever* occurring? Have you looked at any profiler traces? – talonmies May 03 '16 at 05:13
  • @DmitriBudnikov - I am running the code exactly as written in the link. – Alex Hall May 03 '16 at 05:35
  • @talonmies - First part is not really related except to give context as to why I'm even working on this. My CUDA based histogram code runs significantly slower than OpenCV's calcHist() function running on the CPU. In an effort to figure out why, I was trying to check if my code is not running concurrently. I ran the sample code linked above; I expected the first run (the concurrent one) to be faster than the second one (nonconcurrent), but it's not. Why would that happen? Should I just post a question directly addressing my kernel? – Alex Hall May 03 '16 at 05:39
  • If it isn't related, then it shouldn't be in the question. so perhaps you should remove it for clarity. You do understand stand what "concurrent" means in this context? It means multiple kernels being launched and run using the streams API, and *potentially* running concurrently on the GPU, if resources allow. That doesn't automatically infer any performance advantage. – talonmies May 03 '16 at 05:56

1 Answers1

1

The only way to truly see what is happening with concurrent kernel execution is to profile the code.

With the inner kernel launch loop as posted on the wiki:

# Run kernels many times, we will only keep data from last loop iteration.
for j in range(10):
    for k in range(n):
        event[k]['kernel_begin'].record(stream[k])
        my_kernel(d_data[k], block=(N,1,1), stream=stream[k]) 
    for k in range(n): # Commenting out this line should break concurrency.
        event[k]['kernel_end'].record(stream[k])

the profile trace looks like this: enter image description here

With the inner kernel launch loop like this (i.e. the kernel end events not pushed onto the stream within their own loop:

# Run kernels many times, we will only keep data from last loop iteration.
for j in range(10):
    for k in range(n):
        event[k]['kernel_begin'].record(stream[k])
        my_kernel(d_data[k], block=(N,1,1), stream=stream[k]) 
#    for k in range(n): # Commenting out this line should break concurrency.
        event[k]['kernel_end'].record(stream[k])

I get this profile:

enter image description here

i.e. the kernels in the two execution streams are still overlapping.

So the reason why the execution time doesn't change between the two examples is because the comment you are relying on is erroneous. Both cases yield kernel execution overlap ("concurrency").

I have no interest in understanding why that is the case, but that is the source of your confusion. You will need to look elsewhere for the source of poor performance in your code (which apparently doesn't use streams anyway so this entire question was a straw man).

Community
  • 1
  • 1
talonmies
  • 70,661
  • 34
  • 192
  • 269
  • Thank you for the detailed explanation. It is helpful to know that the comment was erroneous. I assume you're looking at my other question for the claim that my code was not using streams. You're correct that the code I posted in the other question does not use streams. An earlier version of the code was attempting to use streams. They seemed to not be making a difference. To ensure that I was using and understanding streams properly, I was trying to set up that sample code. When it wasn't working as described, I asked for help. I will improve my future Stack Overflow requests. – Alex Hall May 07 '16 at 02:09