52

I have a GeForce GTX 580, and I want to make a statement about the total number of threads that can (ideally) actually be run in parallel, to compare with 2 or 4 multi-core CPU's.

deviceQuery gives me the following possibly relevant information:

CUDA Capability Major/Minor version number:    2.0
(16) Multiprocessors x (32) CUDA Cores/MP:     512 CUDA 
Maximum number of threads per block:           1024

I think I heard that each CUDA core can run a warp in parallel, and that a warp is 32 threads. Would it be correct to say that the card can run 512*32 = 16384 threads in parallel then, or am I way off and the CUDA cores are somehow not really running in parallel?

en casa
  • 137
  • 2
  • 10
Eskil
  • 3,385
  • 5
  • 28
  • 32
  • To expand on what @CygnusX1 said, remember that SIMD is 128 (and now 256) bits wide. So for single precision, we could say that 1 CPU core looks like 8 GPU core, making a 10-core CPU look like an 80 core GPU. Note that Hyperthreading does not enjoy SIMD on both threads. Next, we have to consider the clock speed and work-per-clock advantage of the CPU core. So the only way to measure relative performance is with a workload. – IamIC Feb 12 '12 at 09:21
  • 1
    http://gamedev.stackexchange.com/questions/17243/how-many-parallel-units-does-a-gpu-have – Ciro Santilli OurBigBook.com Mar 31 '16 at 09:38

3 Answers3

68

The GTX 580 can have 16 * 48 concurrent warps (32 threads each) running at a time. That is 16 multiprocessors (SMs) * 48 resident warps per SM * 32 threads per warp = 24,576 threads.

Don't confuse concurrency and throughput. The number above is the maximum number of threads whose resources can be stored on-chip simultaneously -- the number that can be resident. In CUDA terms we also call this maximum occupancy. The hardware switches between warps constantly to help cover or "hide" the (large) latency of memory accesses as well as the (small) latency of arithmetic pipelines.

While each SM can have 48 resident warps, it can only issue instructions from a small number (on average between 1 and 2 for GTX 580, but it depends on the program instruction mix) of warps at each clock cycle.

So you are probably better off comparing throughput, which is determined by the available execution units and how the hardware is capable of performing multi-issue. On GTX580, there are 512 FMA execution units, but also integer units, special function units, memory instruction units, etc, which can be dual-issued to (i.e. issue independent instructions from 2 warps simultaneously) in various combinations.

Taking into account all of the above is too difficult, though, so most people compare on two metrics:

  1. Peak GFLOP/s (which for GTX 580 is 512 FMA units * 2 flops per FMA * 1544e6 cycles/second = 1581.1 GFLOP/s (single precision))
  2. Measured throughput on the application you are interested in.

The most important comparison is always measured wall-clock time on a real application.

talonmies
  • 70,661
  • 34
  • 192
  • 269
harrism
  • 26,505
  • 2
  • 57
  • 88
  • 1
    Thanks. Why is the number of CUDA Cores (512) not the same as the number of concurrent warps (16*48 = 768) ? It would make more sense if it was 512 Cuda Cores * 48 threads per warp = 24576 threads. You sure it's not 48 threads per warp? – Eskil Jun 27 '11 at 09:34
  • There are 32 threads per warp. That is a constant across all cuda card as of now. – Pavan Yalamanchili Jun 27 '11 at 15:24
  • @Eskil, yes I'm positive. You need to be careful because I think you are confusing concurrency and throughput. I've updated my answer... – harrism Jun 27 '11 at 22:30
  • Approximately how many times (orders of magnitude?) speedup would you get if you transferred something that could be parallelized, say e.g. numerically solving a partial differential equation, from running on a single CPU thread (i.e. not parallelized at all) in a program written in a lowe-level language to running on a GTX 580 with CUDA or OpenCL code? – HelloGoodbye Jul 05 '16 at 23:08
  • @HelloGoodbye That depends a lot on the CPU and especially the problem and your existing implementation. I know, anoyingly vauge answer, but remember that cpu's and gpu's are very different. in worst case, if your problem is very memory-intesive or branching, and you have already implemented it with vector operations (SIMD), then you will experience an even worse performance. – Rasmus Damgaard Nielsen Jul 18 '16 at 11:25
9

There are certain traps that you can fall into by doing that comparison to 2 or 4-core CPUs:

  • The number of concurrent threads does not match the number of threads that actually run in parallel. Of course you can launch 24576 threads concurrently on GTX 580 but the optimal value is in most cases lower.

  • A 2 or 4-core CPU can have arbitrary many concurrent threads! Similarly as with GPU, from some point adding more threads won't help, or even it may slow down.

  • A "CUDA core" is a single scalar processing unit, while CPU core is usually a bigger thing, containing for example a 4-wide SIMD unit. To compare apples-to-apples, you should multiply the number of advertised CPU cores by 4 to match what NVIDIA calls a core.

  • CPU supports hyperthreading, which allows a single core to process 2 threads concurrently in a light way. Because of that, an operating system may actually see 2 times more "logical cores" than the hardware cores.

To sum it up: For a fair comparison, your 4-core CPU can actually run 32 "scalar threads" concurrently, because of SIMD and hyperthreading.

CygnusX1
  • 20,968
  • 5
  • 65
  • 109
  • I remembered the value 4, but now, when I checked it, it seems you are right. I stand corrected. – CygnusX1 Jun 27 '11 at 19:08
  • 4
    @CygnusX1, saying that a CPU can have arbitrary may concurrent threads is not a fair comparison to the GPU occupancy computation of 24,576 threads. The reason is that the GPU has enough resources on-chip to have 24,576 threads actually resident simultaneously. That means it can switch between those resident warps without moving any data off- or on-chip. CPUs have much more limited resources on-chip; therefore while they may support an arbitrary number of "concurrent" threads, those threads are not all resident on-chip; more than 2 per core requires moving context in and out of registers. – harrism Jun 27 '11 at 22:47
  • I agree that extra threads require moving context in and out of registers, but it may still land in a local L1 or L2 cache (I believe those are now on-chip right?). I don't know how many threads can be kept there, but certainly more than 2 - if they are small. I agree however, that all those threads cannot be managed by the hardware, that's why I later talk about SIMD and hyperthreading. – CygnusX1 Jun 28 '11 at 05:57
  • So 24576 is just the number of threads "stored" on the chip, and not the number of threads actually run at the same time. But how many can actually run at the same time? Or is the point that the answer to that question would just be misleading when considering performance? – Eskil Jun 28 '11 at 07:55
  • The number of threads running in parallel matches the number of CUDA cores (512 in your case). However, during long memory access (e.g. global memory access, which can take hundreds of clock cycles), new threads are assigned to the same cores by the hardware. That's why it is usually useful to actually launch more threads than cores. – CygnusX1 Jun 28 '11 at 07:59
  • 1
    Actually, even that's not quite right (which is why answering this question is so hard). Newer GPUs have increasing amounts of multi-issue -- a single multiprocessor can issue instructions from multiple warps simultaneously. For example, the SM in GTX580 can issue 2 16-wide math operations, a memory LD/ST, and a tex op in 1 cycle. So it is possible to execute up to 2x as many instructions as there are SPs. But in practice it is not common to sustain an IPC > 2 on Fermi. I would still argue that since the hardware constantly switches all resident threads, all of those threads are "running". – harrism Jun 29 '11 at 22:54
  • This answer isn't quite right - To compare a CPU to a GPU you simply need to look at the max FLOPS for the data type and operation in question. Single point, double point, and integer operation performance can vary widely on different architectures (e.g. AVX is a 8-wide SIMD unit. SSE is 4). If you want to compare apples-to-apples, look at peak FLOPs. Comparing thread counts between CPU to GPU is apples to oranges. – Robear Mar 01 '17 at 17:02
  • The question is about how *thread* count on one device can be compared to another. My comparison is more fair than the one given in the question. But if you argue that I am comparing apples to organes, I will say you compare apples to bananas. There is no 100% fair comparison. Comparing FLOPs can be misleading as well: it ignores memory access costs, it ignores the SIMD width (the higher width, the more threads tend to stay idle during branches), etc. – CygnusX1 Mar 01 '17 at 20:57
  • @Nik-Lz That depends on the supported vector instructions by your processor. Typical instructions are SSE: (128-bit), AVX2 (256-bit) and AVX512 (512-bit). Assuming you work with 32-bit floats, that's 4-wide/8-wide/16-wide instructions. Intel spec for i7 3770 shows SSE and AVX support but not AVX2. – CygnusX1 Sep 23 '17 at 20:34
0

I realize this is a bit late but I figured I'd help out anyway. From page 10 the CUDA Fermi architecture whitepaper:

Each SM features two warp schedulers and two instruction dispatch units, allowing two warps to be issued and executed concurrently.

To me this means that each SM can have 2*32=64 threads running concurrently. I don't know if that means that the GPU can have a total of 16*64=1024 threads running concurrently.

Mitch
  • 167
  • 1
  • 7
  • As for GTX 580, each SM could have 48 resident warps. If the resources allow such maximum resident warps, does 2 warp schedulers and 2 instruction dispatch units imply there are always 46 warps waiting for instruction issue on each cycle? – Thomson Sep 20 '19 at 17:57
  • @Thomson, I'm not sure how you arrived at 46 warps. – Mitch Sep 23 '19 at 16:26
  • there are 48 resident warps in one SM, and 2 warps selected to run, so the remaining 46 warps is either stalled or eligible to run, but not running on hardware? – Thomson Sep 23 '19 at 18:51
  • @Thomson, I have no idea what idle warps are up to while others are executing, you might look further into the whitepaper I linked to, or try to see if someone at NVIDIA will discuss it with you. – Mitch Sep 23 '19 at 19:52