0

I have run into a strange issue while running the same OpenCL kernel on multiple machines. Please see below:

 OS                  OpenCL version           GPU            Output Accuracy

LINUX                    2.0             AMD-R9 290X              Good

 Mac                     1.2             Nvidia GT-750M           Good

 Mac                     1.2             AMD Firepro D500        Incorrect 

LINUX                    1.1             Nvidia Tesla K20         Good

I posted on Apple forums, and the only reply I have received is that I should disable fast path math. I am not enabling it anywhere.

In terms of performance, the code runs two times slower on the Firepro when compared to the other discrete GPUs (Tesla and R9) in the list.

Can someone please tell what could be going on? I am happy to share the code if needed.


Here is the OpenCL kernel (some of the variable/function names are not proper): http://pastebin.com/Kt4TinXt

Here is how it is called from the host:

sentence_length = 1024
num_sentences = 6
count = 0
for(sentence in textfile)
{
     sentences += sentence
     count++ 
     if(count == num_sentences - 1)
         enqueuekernel(sentences)

}

A sentence is basically a group of 1024 words. The level of parallelism is at the word level. I chose to use 128 work-items per word, because that allowed me to keep neu1 and neu1e in the shared memory. I tried other combinations like 'layer1_size' work items per word, or 1 wavefront per word, but that did not give good performance at all. Even now, the performance is not that great, but it gives me around 2.8X (compared to 6 core Xeon) on the R9 and Tesla.

Please let me know if more detail is needed!

user1274878
  • 1,275
  • 4
  • 25
  • 56
  • r9-290x is double the power of D500 because of core numbers and frequencies. Tesla's uArch and your algorithm may be better together than D500 and your algorithm. – huseyin tugrul buyukisik Sep 14 '15 at 22:14
  • "Can someone please tell what could be going on? I am happy to share the code if needed." -- unless there are mind readers among us, nobody will be able to guess why one device gives you incorrect results. This is SO; tell us what you tried already. – Dithermaster Sep 14 '15 at 22:46
  • @huseyintugrulbuyukisik, as you can see from the OpenCL kernel, it is memory bound. Firepro D500's peak memory bandwidth is 240 GBps, while Tesla's is 208 GBps. So I was hoping to get better performance. – user1274878 Sep 15 '15 at 15:29

0 Answers0