AVX512 has several/most floating point instructions available in masked form, where you can select which results will be changed/zeroed. Do the CPUs actually use this info schedule which say multiplications should be performed, or does this merely set which results are overwritten and the nonmasked ones are calculated anyways and then discarded?
In practice this can be useful for processing for instance:
for (int i=0; i<27; i++) a[i] *= b[i];
27 isn't divisible by 8, so there will be some remaining items. One can create a separate cycle processing one by one. Or use AVX if there's at least 4 and then process the remaining ones one by one. Many possibilities.
This specific loop would a good compiler vectorize well automatically, but it's just an example for more complex cases, where I want to vectorize manually, but there are still unused items, or for example I know that some multiplication results are not useful.
Edit: Experimentally checked, and it seems that masked operations are actually slower.