I am trying to let a program run on my GPU and to start with an easy sample I modified the first sample on http://www.jocl.org/samples/samples.html and to run the following little script: I run n simultaneous "threads" (what's the correct name for the GPU equivalent of a thread?), each of which performs 20000000/n independent tanh()-computations. You can see my code here: http://pastebin.com/DY2pdJzL
The speed is by far not what I expected:
- for n=1 it takes 12.2 seconds
- for n=2 it takes 6.3 seconds
- for n=3 it takes 4.4 seconds
- for n=4 it takes 3.4 seconds
- for n=5 it takes 3.1 seconds
- for n=6 and beyond, it takes 2.7 seconds.
So after n=6 (be it n=8, n=20, n=100, n=1000 or n=100000), there is no performance increase, which means only 6 of these are computed in parallel. However, according to the specifications of my card there should be 80 cores: http://www.amd.com/us/products/desktop/graphics/ati-radeon-hd-5000/hd-5450-overview/pages/hd-5450-overview.aspx#2
It is not a matter of overhead, since increasing or decreasing the 20000000 only matters a linear factor in all the execution times.
I have installed the AMD APP SDK and drivers that support OpenCL: see http://dl.dropbox.com/u/3060536/prtscr.png and http://dl.dropbox.com/u/3060536/prtsrc2.png for details (or at least I conclude from these that OpenCL is running correctly).
So I'm a bit clueless now, where to search for answer. Why can JOCL only do 6 parallel executions on my ATI Radeon HD 5450?