0

AVX512 has several/most floating point instructions available in masked form, where you can select which results will be changed/zeroed. Do the CPUs actually use this info schedule which say multiplications should be performed, or does this merely set which results are overwritten and the nonmasked ones are calculated anyways and then discarded?

In practice this can be useful for processing for instance:

for (int i=0; i<27; i++) a[i] *= b[i];

27 isn't divisible by 8, so there will be some remaining items. One can create a separate cycle processing one by one. Or use AVX if there's at least 4 and then process the remaining ones one by one. Many possibilities.

This specific loop would a good compiler vectorize well automatically, but it's just an example for more complex cases, where I want to vectorize manually, but there are still unused items, or for example I know that some multiplication results are not useful.

Edit: Experimentally checked, and it seems that masked operations are actually slower.

Vojtěch Melda Meluzín
  • 1,117
  • 3
  • 11
  • 22
  • Dynamically doing fewer multiplications in a meaningful way (actually increasing throughput) would be difficult to make in hardware – harold Oct 18 '18 at 09:59
  • How are you testing the masked operations? If you're adding dependencies to the code, it can slow it down. All other things being equal masked instructions should be the same as unmasked instructions. – Mysticial Oct 18 '18 at 16:05
  • 1
    @harold The only thing I imagine is more indirect. Zero'ed lanes could simply no-op the operation and consume no power for that instruction. So with enough of them, it may affect the thermals of the chip, thus affecting the turbo boosts. This can probably be tested experimentally by watching the CPU temperatures + power consumption when looping code with different #s of lanes enabled via mask. – Mysticial Oct 18 '18 at 16:08
  • Doing 3 times 8 multiplications followed by 1 times 3 multiplications will almost certainly be faster than 3 times 8, followed by 3 times 1 multiplications. If masked operations are actually slower, you can try to just mask the write, or to overwrite 5 (or 1) of the last 8 elements. Of course, this also depends on whether you know the size of your vectors at compile time. – chtz Oct 19 '18 at 11:18

0 Answers0