I'm experiencing inconsistent behavior w.r.t the availability of bfloat16 operators when compiling kernel code with NVRTC, on different machines - but with the same CUDA version, 11.2 (when including cuda_bf16.h
)
On one machine, this operator from cuda_bf16.hpp
:
__device__ __forceinline__ __nv_bfloat16 &
operator+=(__nv_bfloat16 &lh, const __nv_bfloat16 &rh)
{ lh = __hadd(lh, rh); return lh; }
is available; but on the other machine - it's missing, and my compilation fails. If I define the operation myself - that will work on one machine, and fail on the other because of a redefinition.
This is already enough to ask the question: What could cause the operator to only be available on one of the machines?
But I have what is perhaps a clue: The definition of the operators is enclosed in:
#if !defined(__CUDA_NO_BFLOAT16_OPERATORS__)
#endif /* !defined(__CUDA_NO_BFLOAT16_OPERATORS__) */
and I'm suspecting that something, somewhere, is defining that. What could it be? I can't find this string anywhere in /usr/local/cuda-11.2/include
other than these two lines.
Note: The machines' driver version differs. One of them is 460.73.01, and the other is 535.54.03.