18

Having stumbled over this forum thread, dot product faster on cpu than on gpu using OpenCL, I was reminded again, that there are instances, which look like they're made for OpenCL*, but where they're used, OpenCL does not provided us with a gain. i.e. I also have a kmeans implementation using pyopencl code which is several times faster than a simple python code, but still several times faster than the scipy function for kmeans.

So how do you decide when to use OpenCL?

  • What graphics card do you need? How much 'better than the cpu' does the graphics card have to be. Is Quadro FX 580 vs. i7 860 enough?
  • How big does the problem have to be? Do you need millions of multiplications to gain something or are several hundreds enough?
  • How much optimizing of an even 'simple' algorithm like kmeans or the dot product is necessary to make OpenCL worthwhile?

Or is it one of these triangle cases, where you only can (/have to) choose two of the three corners to make it work?

    problem size
        /\
       /  \
      /    \
     /      \
    /________\
GPU/CPU   optimization

I know, that I used a little bit too bold of language for the title and the questions. I'll change it, if I can think of a more suitable wording.

Thanks.

* simple matrix operation like dot product, kmeans or matrix multiplications

Community
  • 1
  • 1
Framester
  • 33,341
  • 51
  • 130
  • 192

5 Answers5

12

The real key should be whether your algorithm has a lot of inherent parallelization in it where you can hand over a data set and have significant amount of parallel processing happen on it. Remember a GPU may have many many cores, but they each only clock .5-1GHZ. The strength is in processing large amounts of parallel operations to get extremely high throughput.

Consider throughput as (data computed * frequency * pipeline stages) - so there's going to be a tradeoff of going with say 1/6th the frequency with one of those GPU cores, but probably more than 6* the number of cores (pipeline stages).

Of course there's additional overhead of the CPU <-> GPU barrier, and also your algorithm could result in multiple GPU clock cycles to compute..

Nektarios
  • 10,173
  • 8
  • 63
  • 93
8

A few elements of answer:

  • Dot product is not the best suited operation to run on GPU, because it is essentially a reduction, requiring synchronization between threads.
  • Any "recent" GPU will be OK: NVIDIA GTX 2xx, ATI/AMD HD5xxx or later are best suited to OpenCL use.
  • Moving data to/from the GPU is slow, typically 6 GB/s in the best case. If you data fits in the CPU cache, the CPU will probably be faster, unless the Compute/IO ratio of the task is large.
  • Efficient code for simple algorithms can be found in AMD/NVIDIA code samples, and in various websites. For other algorithms, finding a correct design and optimizing the code can take some time. After some point, optimizations are specific to each micro-architecture, and require even more time.
Eric Bainville
  • 9,738
  • 1
  • 25
  • 27
  • 1
    consider a dot product on a matrix with a million or more values, parallelization would greatly increase performance, each individual value in the new array does not depend on the last so each output can be done individually, then each workgroup can be parallelized even more, by doing the multiplications parallel and then adding up all the results. none of it has to be synchronized really, as long as it all gets done, no operation depends on the last. – Jordan LaPrise Feb 14 '18 at 20:03
6

Like each technology decision, the answer depends on the goal to reach. Information about the OpenCL capabilities of GPUs can be found on the vendor pages. Pay attention: Not all GPUs support OpenCL and not all GPUs supporting OpenCL support double precision. You also might think about your customers/ client which might not have a OpenCL capable environment.

GPGPU programming (OpenCL and CUDA) are suitable for (almost) all kinds of Linear Algebra problems. These problems are quite easy to parallelize and fit therefore easily on a parallel environment like GPUs. All problems which shall go on the GPU need to be not too complex and parallel designed. This really depends on you problem domain.

On the other side you need to pay attentions to some payoffs of OpenCL. One needs to copy some data around from RAM to GPU and back, which leads to some delays. You should do some time measurements of different problem sizes on CPU and GPU. You will easily see when the break even is reached. I tried a matrix multiplication with ATLAS library on CPU Opteron X64 2x2600 and GPU Geforce 8600GTS. The matrix multiplication was just two matrices with dimensions NxN. The break evens was for N roughly around 100. This result heavily depends on the CPU and GPU used and might be totally different on other hardware.

Rick-Rainer Ludwig
  • 2,371
  • 1
  • 26
  • 42
  • 1
    Adding to what rick has mentioned already, if the problem size is large enough, you will almost invariably get a good performance out of handing over computations to the gpu. So if a given function is slow on the gpu (compared to the cpu), you could still hide the overhead if it was part of a larger code implemented on the GPU, rather than having to bear the over head of transfer between host (MAIN) and device (GPU) memory. – Pavan Yalamanchili Apr 21 '11 at 23:34
0

GPU's are all about data processing where intensive computations take place. You can offload CPU by porting your computation intensive tasks to GPU. The results you receive are up to you since GPU is only a tool, it requires 'correct' use.

Lu4
  • 14,873
  • 15
  • 79
  • 132
0

In addition to the other answers, another point is the other processes which are being executed at the same time.

For example, I have a job A and B(data parallel).

Case 1: While the A is executed CPU usage is 3%, and GPU usage is %0 (generally the case for me while I am using my computer with daily programs). In this case, no need for putting B to GPU (of course, executing B in GPU by OpenCL might be beneficial according to the other parameters that were stated in the previous answers).

Case 2: However, when A is the serial process that increases the CPU usage much more than in the previous case. I think putting B to GPU is needed even if the B is executed on GPU slower than CPU.

I think this should also be a choice parameter to decide to use OpenCL.