Java/Open CL/Aparapi: What to kind of performance to expect from which device?

Question

In order to get a rough feeling for how much openCl is going to help me, I ran a test of matrix-matrix multiplication as this kind of basic linear algebra will be my primary use. The code I used can be found here: http://vasanthexperiments.wordpress.com/2011/11/20/aparapi-java-matrix-multiplication-example/. (1024*1024 x 1024*1024 matrix-matrix product)

Basically, I was quite disappointed by the results as the speedup was only marginal compared to serial computation on the CPU (less than x2) and if I made Aparapi use the CPU (which it does parallelized) the CPU was even faster.
During execution, the graphic card is under full load so I think there should be no communication issues.

My hardware config.:
i7 2670QM
AMD 7610M
16GB RAM

Since I'm completely new to GPGPUs I don't know what to expect.
1. Is it likely that my setup is somehow screwed? If so, where should I look?
2. Or am I simply expecting too much from an entry level graphic card? If so, how do different models of graphic cards scale with this kind of problem? What are the specs that I have to look for if I wanted to get hardware that is faster?

EDIT:

Ok, so I just reran the program with a 10x10 matrix.
Unsurprisingly, the CPU needed less than 1ms.
However, the GPU needs more than 1600, so there is definitely something wrong with either Aparapi or openCL or my hardware (drivers should be up to date). Anyone an idea where I should look?

score 1 · Answer 1 · answered Oct 11 '13 at 20:18

1

Part of the problem with your comparison is that you compare a low end mobile GPU to a good mobile CPU. The single precision speed of your GPU is roughly 2x that of your CPU, and their memory bandwidths are similar. Those are the two specifications you want to look closely at.

Last time I checked linear algebra routines, they were able to get about 60% of the peak floating point speed of a GPU. Speeds of all the current AMD and Nvidia GPUs are listed on Wikipedia here and here. You will also want to go with newer GPUs rather than older ones.

answered Oct 11 '13 at 20:18

chippies

1,595
10
20

Thanks. I guess few actually tried using such low end cards for GPGPU. I assumed that the GPU with its 400 processors would still be noticeably faster than a 8-threaded CPU. I guess I will have to look for some way to get access to some serious hardware. – John Smith Oct 12 '13 at 10:18
I just noticed that even for 10x10, the GPU needs 1,6secs as opposed to 0msecs on the CPU. seems like the issue is unrelated to the actual GPU performance to a large extent. – John Smith Oct 12 '13 at 10:41
When comparing against a CPU, remember that new Intel CPUs support vector operations that operate on 8 32-bit floating point numbers in one step. So 4 cores * 8 floats = 32 floating point operations. Each of those 400 cores you use does only one floating point operation (2 if using fused-multiply-add). The 1.6 seconds sounds like a combination of the time taken to generate the OpenCL code, build it and some startup time for the GPU - the first OpenCL kernel call is always slower as the GPU needs to increase its clock-speeds - GPU runs at lower speeds when OpenCL & 3D apps are not running. – chippies Oct 13 '13 at 13:40

score 1 · Answer 2 · answered Oct 12 '13 at 04:39

1

I tested the C language version of the example code using AMD HD 7850 and Intel Core i7-2600K. For 1024X1024 case, the HD 7850 gpu takes 42 ms while the single threaded cpu function takes nearly 7 seconds.

For 128X128, HD 7850 gpu takes 4.9 ms while the single threaded cpu function takes only 2.0 ms.

So for cases where the openCL algorithm can enough produce parallelism to fully load the GPU, the HD 7950 GPU is much faster than a single CPU thread. Even id all CPU threads were used, the GPU would still be faster for large matrices.

answered Oct 12 '13 at 04:39

That's really impressive. I get roughtly 9secs with the mobile i7 processor. I guess it's really due to my card being low end. Thanks for the feedback. – John Smith Oct 12 '13 at 10:14
I just found this benchmark: http://gpuboss.com/gpus/Radeon-HD-7850-vs-Radeon-HD-7610M. AFAIK the benchmark programs used should return a linear score so the 7610M should perform much closer to the 7850. It seems there are some major issues with my setup as it takes 1,6secs to compute even 10x10 on the GPU (0msecs on CPU). – John Smith Oct 12 '13 at 10:40
My results for gpu matrix multiplication time only include the `clEnqueueNDRangeKernel` and `clEnqueueReadBuffer` statements. The `clBuildProgram` step takes considerable time but I left it out because it is not part of the calculation. Could this be a factor? – Oct 12 '13 at 13:32
Since I'm using Aparapi I don't have much direct access to the internals of openCL. I can however request the conversion time which it indeed reports as 1.6 secs regardless of matrix size. However, at this point I'm quite clueless as to what the card might be doing during the remaining time (~7secs). – John Smith Oct 12 '13 at 14:59
Array sending & receiving through pci-e included for those timings? Actual calculation is 10x faster I suspect. 19ms sending + 4ms calc + 19 ms receiving ? – huseyin tugrul buyukisik Nov 16 '13 at 18:55
Also single execution is not optimized properly by JIT so 4-5 more repeatations with more matrices should give real performance. – huseyin tugrul buyukisik Nov 16 '13 at 18:58

Java/Open CL/Aparapi: What to kind of performance to expect from which device?

2 Answers2