What is the granularity of "masked" stores in AVX512?

Question

Lets say you call _mm512_mask_store_ps, from the point of view of the CPU's write buffer, is it executed as a store of size 64-bytes (with some sort of masking) or is it executed internally as multiple stores of size 4-bytes?

In order to prevent store-to-load forwarding stalls, one must match the granularity (size) of a store to the granularity of subsequent loads to the same memory location. Hopefully the question makes sense, I'm no CPU architecture expert.

See [§11.9 CONDITIONAL SIMD PACKED LOADS AND STORES](https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf). — Iwillnotexist Idonotexist, Sep 03 '20 at 21:02

score 6 · Answer 1 · answered Sep 03 '20 at 22:03

As Iwillnotexist referenced:

If the mask is not all 1 or all 0, loads that depend on the masked store have to wait until the store data is written to the cache. If the mask is all 1 the data can be forwarded from the masked store to the dependent loads. If the mask is all 0 the loads do not depend on the masked store.

So there's no store-to-load forwarding for masked-stores, except for the case when the mask is all ones (behaves like a regular store), or all zeros (trivial). Load after a masked-store generally waits for data to be sent to cache, so it should be pretty expensive.

Using scatter-gather to do multiple hashmap-counter updates in parallel is sounding like more and more of a fool's errand, between this and the difficulties of efficiently handling collisions... — TLW, Nov 29 '22 at 02:49

What is the granularity of "masked" stores in AVX512?

1 Answers1