I have run into a strange issue while running the same OpenCL kernel on multiple machines. Please see below:
OS OpenCL version GPU Output Accuracy
LINUX 2.0 AMD-R9 290X Good
Mac 1.2 Nvidia GT-750M Good
Mac 1.2 AMD Firepro D500 Incorrect
LINUX 1.1 Nvidia Tesla K20 Good
I posted on Apple forums, and the only reply I have received is that I should disable fast path math. I am not enabling it anywhere.
In terms of performance, the code runs two times slower on the Firepro when compared to the other discrete GPUs (Tesla and R9) in the list.
Can someone please tell what could be going on? I am happy to share the code if needed.
Here is the OpenCL kernel (some of the variable/function names are not proper): http://pastebin.com/Kt4TinXt
Here is how it is called from the host:
sentence_length = 1024
num_sentences = 6
count = 0
for(sentence in textfile)
{
sentences += sentence
count++
if(count == num_sentences - 1)
enqueuekernel(sentences)
}
A sentence is basically a group of 1024 words. The level of parallelism is at the word level. I chose to use 128 work-items per word, because that allowed me to keep neu1 and neu1e in the shared memory. I tried other combinations like 'layer1_size' work items per word, or 1 wavefront per word, but that did not give good performance at all. Even now, the performance is not that great, but it gives me around 2.8X (compared to 6 core Xeon) on the R9 and Tesla.
Please let me know if more detail is needed!