0

According to the Kepler architecture whitepaper, a SMX has 192 CUDA cores and 64 Double Precision Units (DPUs). For a K20Xm there are 14 SMXs totalling at 2688 cores, which means that only the CUDA cores are counted. What exactly is then the usage of the DPUs for and how is their usage related to the cores?

My thoughts:

a) The CUDA cores can't do double precision operations and only the DPUs can. Therefore, the CUDA cores are free for other stuff while the DPUs are busy.

b) The CUDA cores somehow need a double precision unit to do double precision operations, therefore only 128 of the 192 CUDA cores are available for other stuff.

Cheers Andi

Vitality
  • 20,705
  • 4
  • 108
  • 146
user2267896
  • 173
  • 2
  • 9
  • I'm not sure why this question has been flagged as unclear. I'm voting to reopen it. – Vitality Dec 09 '13 at 20:55
  • @Talonmies et al: I don't understand how the double precision units are used. How they relate to certain jobs i.e. do they block CUDA single precision cores? Or are those free for other jobs? Why aren't they included in the total number of cores in the specs for the K20Xm? Or in a more crude way: What do I need to take care of if I want to use a Kepler card in the most efficient way. – user2267896 Dec 09 '13 at 21:01
  • Please, go through Robert Crovella's answer. He clearly says that DPUs are _independent from the "CUDA cores"_, so I do not expect that they block CUDA single precision cores (single and double precision cores can work simultaneously). For the previous `sm=2.0` Fermi architecture, there were only `32` single precision cores and each double precision instruction consumed `2` single precision cores. For this reason, double precision instructions did not support dual dispatch with any other operation. Now, dual-issue is possible also with double precision operations. – Vitality Dec 09 '13 at 21:17
  • Ahh I just found the source of my confusion. The K20Xm has 3 CUDA cores for each DPU, which implies that its single precision performance should be three times the double precision performance (which it is according to their paper) But somehow I thought I had read that it would only provide a speedup of 2. Sorry for the confusion... and thanks for the answers – user2267896 Dec 11 '13 at 14:51

1 Answers1

2

The double precision units are actually separate hardware floating point units that do double precision arithmetic. They are independent from the "cuda cores", which roughly speaking, could be considered to be the single-precision units.

So for single precision arithmetic, the throughput can be computed based on the "cuda cores" or single precision units. For double precision arithmetic, the throughput must be computed based on the double precision units.

In a Kepler K20 SMX, the ratio of double-precision units to single precision units is 1:3. Therefore the throughput for each type of arithmetic follows the same ratio. By "arithmetic" I mean here floating point multiply or floating point add.

Robert Crovella
  • 143,785
  • 11
  • 213
  • 257
  • So in other words, if I have 3 single precision instructions for each double precision instruction, I will get the most "efficient" kernel (ignoring possible dependencies) – user2267896 Dec 09 '13 at 14:13
  • @user2267896 I'm not sure if this comment makes sense. Tipically, you work in either single or double precision arithmetics and you do not mix the two. If you mix single and double precision, then your final result will have less precision than double, making it meaningless to subsequently issue double precision operations. What you can say is that you can now mix together (dual-issue) integer and double precision operations, at variance with Fermi and, as already stated by Robert Crovella, the single precision throughput is three times the double precision one. – Vitality Dec 09 '13 at 21:28
  • You cannot simultaneously get peak double precision throughput and peak single precision throughput out of any GPU. Your question now appears to be one based primarily about scheduling. You cannot simultaneously retire 192 SP floating point ops and 64 DP floating point ops in a Kepler SMX. – Robert Crovella Dec 10 '13 at 16:38