Lets say you call _mm512_mask_store_ps, from the point of view of the CPU's write buffer, is it executed as a store of size 64-bytes (with some sort of masking) or is it executed internally as multiple stores of size 4-bytes?
In order to prevent store-to-load forwarding stalls, one must match the granularity (size) of a store to the granularity of subsequent loads to the same memory location. Hopefully the question makes sense, I'm no CPU architecture expert.