I had a quick glance of the CUDA Programming guide w.r.t -use-fast-math optimizations, and although appendix C mention divisions to be converted to an intrinsic but there are no mention of multiplications. The reason I ask this question is, my kernel has a lot of multiplications. I am aware that NVCC would try to fuse multiplications and additions (when regular '*' and '+' operators are used, and that intrinsics are never merged into FMAD operations). But if my code is multiplication heavy, then would there be a benefit if rounding-off SP intrinsic like __fmul_rn
is used?
So there are two questions:
Does -use-fast-math option translate multiplications with '*' operator to SP instrinsics like __fmul_rn?
Could there be a performance benefit in hand-coding multiplications to explicitly use __fmul_rn? An example or some numbers would help me understand.