5

I had a quick glance of the CUDA Programming guide w.r.t -use-fast-math optimizations, and although appendix C mention divisions to be converted to an intrinsic but there are no mention of multiplications. The reason I ask this question is, my kernel has a lot of multiplications. I am aware that NVCC would try to fuse multiplications and additions (when regular '*' and '+' operators are used, and that intrinsics are never merged into FMAD operations). But if my code is multiplication heavy, then would there be a benefit if rounding-off SP intrinsic like __fmul_rn is used?

So there are two questions:

  1. Does -use-fast-math option translate multiplications with '*' operator to SP instrinsics like __fmul_rn?

  2. Could there be a performance benefit in hand-coding multiplications to explicitly use __fmul_rn? An example or some numbers would help me understand.

einpoklum
  • 118,144
  • 57
  • 340
  • 684
Sayan
  • 2,662
  • 10
  • 41
  • 56

1 Answers1

3

"Standalone" single precision multiplication always compiles to hardware instructions ("intrinsics"). There is no other type of floating point multiplication instructions. The -use_fast_math option in nvcc has no effect on the floating point multiplication instructions emitted for compute capability 1.x targets. On compute 2.x and 3.x targets, it puts the compiler into a compatibility mode and all single precision multiplication instructions will be mul.ftz.f32 (flush to zero).

The floating point intrinics you mention (__fmul_{rm,rn,rp,rz,ftz,sat}) only provide explicit control the IEEE rounding behaviour. I don't believe there is a throughput difference between any of them on Fermi or Kepler GPUs.

talonmies
  • 70,661
  • 34
  • 192
  • 269
  • 1
    Note that __fmul_rn() maps to a PTX instruction with a specific IEEE rounding mode. This is turn suppresses certain optimizations, in particular the merging of a single-precision multiplication followed by single-precision addition into a multiply-add type instruction (FMAD on sm_1x, FFMA on sm_2x and sm_3x). See the PTX manual. This property can be useful when one wants to achieve specific numerical properties for some code, and is used for this purpose at various places inside the CUDA math library, for example. – njuffa Jul 16 '12 at 17:46
  • "achieve specific numerical properties for some code, and is used for this purpose at various places inside the CUDA math library" - can you please give an example? Thank you. – Sayan Jul 16 '12 at 17:57
  • 2
    Grepping math_functions.h for __fmul_rn will lead to various worked examples. Note that the sm_1x FMAD involves a truncating multiplication. Where that leads to an unacceptable loss of accuracy, you can inbibit FMAD merging locally by use of __fmul_rn(). There is a also an nvcc flag -fmad=false, but that inhibits FMAD merging for the entire compilation unit, which typically would have a significant negative impact on performance. – njuffa Jul 17 '12 at 01:10