Performance Analysis of Multiple Kernels (CUDA C)

Question

I have CUDA program with multiple kernels run on series (in the same stream- the default one). I want to make performance analysis for the program as a whole specifically the GPU portion. I'm doing the analysis using some metrics such as achieved_occupancy, inst_per_warp, gld_efficiency and so on using nvprof tool.

But the profiler gives metrics values separately for each kernel while I want to compute that for them all to see the total usage of the GPU for the program. Should I take the (average or largest value or total) of all kernels for each metric??

I would use a weighted average, where the weighting factor is the kernel execution time over the sum of all kernel execution times. — Robert Crovella, Nov 06 '18 at 23:03
Thank you for your reply I'm needing that a lot. According to my understanding, if I have 3 kernels and I want to compute the overall achieved occupancy for them while I have the occupancy values for each one separately: 1- I have to first compute the weighted factor for each kernel. 2- then multiply this value by achieved occupancy for each value? sorry about the confusion but then how can I compute the overall achieved occupancy? — Sarah Hamed, Nov 07 '18 at 13:50

Robert Crovella · Answer 1 · 2018-11-07T17:44:19.827

One possible approach would be to use a weighted average method.

Suppose we had 3 non-overlapping kernels in our timeline. Let's say kernel 1 runs for 10 milliseconds, kernel 2 runs for 20 millisconds, and kernel 3 runs for 30 milliseconds. Collectively, all 3 kernels are occupying 60 milliseconds in our overall application timeline.

Let's also suppose that the profiler reports the gld_efficiency metric as follows:

kernel     duration    gld_efficiency
     1        10ms               88%
     2        20ms               76%
     3        30ms               50%

You could compute the weighted average as follows:

                                     88*10        76*20        50*30
"overall"  global load efficiency =  -----   +    -----    +   ----- = 65%
                                       60           60           60

I'm sure there may be other approaches that make sense also. For example, a better approach might be to have the profiler report the total number of global load transaction for each kernel, and do your weighting based on that, rather than kernel duration:

kernel     gld_transactions    gld_efficiency
     1        1000               88%
     2        2000               76%
     3        3000               50%


                                     88*1000        76*2000        50*3000
"overall"  global load efficiency =  -------   +    -------    +   ------- = 65%
                                       6000           6000           6000

Thank you that's great, I understood well now. Actually, I have the following metrics to measure: Achieved_occupancy, warp_execution_efficiency, gld_efficiency, gld_transactions, gld_throughput, and the same for gst. so if I use the second approach how I will measure that for the gld_transactions?! so I think the better way is to use the first approach for all metrics. is my decision good? — Sarah Hamed, Nov 07 '18 at 17:52
for `gld_transactions`, you have to decide what you want to report. There is not just one possible answer here. For example, if you want to report the total global load transactions consumed by your application, then you would just add all the numbers reported by the profiler together. But if you wanted to report the average global load transactions per kernel, you should add all the global load transactions together and device by the number of kernel calls. Probably it will vary for each metric that you want a summary report for. I'm not going to try and spell out an answer for each. — Robert Crovella, Nov 07 '18 at 20:59
Thank you for your answer it's helpful. Is there any reference you recommend it regarding metrics dealing and good treatment ?! — Sarah Hamed, Nov 08 '18 at 14:27

Performance Analysis of Multiple Kernels (CUDA C)

1 Answers1