This is a follow-up question to this one.
Suppose I have a CUDA kernel
template<unsigned ThreadsPerWarp>
___global__ foo(bar_t* a, const baz_t* b);
and I'm implementing a specialization of it for the case of ThreadsPerWarp
being 32 (this circumvents the valid criticism of Talonmies' answer to my previous question.)
In the body of this function (or of other __device__
functions called from it) - should I prefer using the constant value of ThreadsPerWarp
? Or is it better to use warpSize
? Or - will it be all the same to the compiler in terms of the PTX it generates?