3

As part of an algorithm profiling running on GPU I feel that I'm hitting the memory bandwidth.

I have several complex kernels performing some complicated operations (sparse matrix multiplications, reduction etc) and some very simple ones and it seems that all (significant ones) hit ~79GB/s bandwidth wall when I calculate the total data read/written for each one of them, regardless the complexity of them, while the theoretical GPU bandwidth is 112GB/s (nVidia GTX 960)

The data set is very large operating on vectors of ~10,000,000 float entries so I get good measurements/statistics from clGetEventProfilingInfo between COMMAND_START and COMMAND_END. All the data remains in GPU memory during algorithm run so there virtually no host/device memory transfer (also it is not measured by profiling counters)

Even for a very simple kernel (see below) that solves x=x+alpha*b where x and b are huge vectors of ~10,000,000 entries, I don't get close to the theoretical bandwidth (112GB/s) but rather is running on ~70% of the maximum (~79GB/s)

__kernel void add_vectors(int N,__global float *x,__global float const *b,float factor)
{
    int gid = get_global_id(0);
    if(gid < N)
        x[gid]+=b[gid]*factor;
}

I calculate data transfer for this particular kernel per run as N * (2 + 1) * 4:

  • N - size of vector = ~10,000,000
  • 2 loads and 1 store per vector entry
  • 4 for sizeof float

I expected that for such a simple kernel I need to get much closer to the bandwidth limits, what do I miss?

P.S.: I get similar numbers from CUDA implementation of the same algorithm

Artyom
  • 31,019
  • 21
  • 127
  • 215

1 Answers1

3

I think a more realistic way to evaluate if you have reached the peak bandwidth is to compare what you get with a simple D2D copy.

For example your kernel read x and b once and write x once, so the upper limit of the execution time should be 1.5x time of copying from b to x once. If you find the time is much higher than 1.5x, it means you probably have space to improve. In this kernel the work is so simple that the overhead (starting and ending the function, computing the index, etc.) may limit the performance. If this is an issue, you may find increasing the work per thread with a grid stride loop helps.

https://devblogs.nvidia.com/parallelforall/cuda-pro-tip-write-flexible-kernels-grid-stride-loops/

As for the theoretical bandwidth, at least you should consider the overhead of ECC if it is enabled.

kangshiyin
  • 9,681
  • 1
  • 17
  • 29
  • 2
    A simple D2D copy is already packaged up for you in the cuda sample code `bandwidthTest`. Just compile and run that sample code, and the reported device-to-device bandwidth is a reasonable proxy measurement for the maximum available memory bandwidth on your GPU. – Robert Crovella Jun 09 '16 at 14:55
  • 1
    Ok both CUDA, OpenCL copying memory via cudaMemcopy/clEnqueueMemrory or using simple memory copy kernel... transfers data at 78GB/s, the Nvidia's sample bandwidthTest gives 78GB/s... i.e. it works at ~70% of the reported. Also GTX 960 does not have ECC feature (at least control of it via nvidia-smi - N/A). It is quite sad, but at least I got the right measurement – Artyom Jun 09 '16 at 18:32
  • 2
    @Artyom It is same as Tesla K40c - ~70% with ECC off, ~64% with ECC on. – kangshiyin Jun 10 '16 at 11:14
  • Thanks... You had given me valuable inputs – Artyom Jun 10 '16 at 13:50