0

How do I programatically find the maximum number of concurrent cuda threads or streaming multiprocessors on a device / nvidia graphics card? I know about warpSize, but there is no warpCount.

most answers on the internet concern themselves with looking up things from pdfs.

Community
  • 1
  • 1
guest
  • 711
  • 1
  • 6
  • 11
  • 2
    The question you linked contains no references to pdfs (that I can see) and mentions `deviceQuery`, a programmatic tool/sample code, that retrieves essentially all of the information that is machine-readable about a CUDA GPU. It will specifically answer how many streaming multiprocessors there are, and the "concurrent threads" is obtained algorithmically from that data, based on exactly what you mean by "concurrent threads". The question you linked points out the difficulty in quantifying imprecise ideas like "concurrent threads", partly because GPU execution units are pipelined. – Robert Crovella Sep 04 '14 at 16:13
  • sorry I was looking for a quick reference. this is way too verbose. if this is the wrong place for this format then I guess I will have do delete this question – guest Sep 04 '14 at 18:34
  • If "concurrent" is the maximum number of threads that can be allocated to physical resources at a given time then the answer is cudaDeviceProp.multiProcessorCount * cudaDeviceProp.maxThreadsPerMultiProcessor. See [cudaDeviceProp](http://docs.nvidia.com/cuda/cuda-runtime-api/structcudaDeviceProp.html#structcudaDeviceProp). – Greg Smith Sep 05 '14 at 22:39
  • @GregSmith maybe you can help me word this question better: I would like to know where to find the parameters that become relevant in the automatic tuning of a kernel & memory allocation, so that all warps are fully utilized at all time and distant cache grabs are minimized. – guest Sep 06 '14 at 02:55
  • Boost uses the term of "[hardware threads](http://www.boost.org/doc/libs/1_56_0/doc/html/thread/thread_management.html#thread.thread_management.thread.hardware_concurrency)" – Jean Davy Oct 01 '14 at 14:50

2 Answers2

2

Have you tried checking their SDK samples , i think this sample is the one you want Device Query

Kiloreux
  • 2,220
  • 1
  • 17
  • 24
1

This does not only depend on the device but also on your code - e.g. things like the number of registers each thread uses or the amount of shared memory your block needs. I would suggest reading about occupancy.

Another thing I would note is that if your code relies on having a certain number of threads resident on the device (e.g. if you wait for several threads to reach some execution point) you are bound to face some race conditions and see your code hanging.

Eugene
  • 9,242
  • 2
  • 30
  • 29