8

Lets say you call _mm512_mask_store_ps, from the point of view of the CPU's write buffer, is it executed as a store of size 64-bytes (with some sort of masking) or is it executed internally as multiple stores of size 4-bytes?

In order to prevent store-to-load forwarding stalls, one must match the granularity (size) of a store to the granularity of subsequent loads to the same memory location. Hopefully the question makes sense, I'm no CPU architecture expert.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
user2059893
  • 409
  • 3
  • 10
  • 4
    See [§11.9 CONDITIONAL SIMD PACKED LOADS AND STORES](https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf). – Iwillnotexist Idonotexist Sep 03 '20 at 21:02

1 Answers1

6

As Iwillnotexist referenced:

If the mask is not all 1 or all 0, loads that depend on the masked store have to wait until the store data is written to the cache. If the mask is all 1 the data can be forwarded from the masked store to the dependent loads. If the mask is all 0 the loads do not depend on the masked store.

So there's no store-to-load forwarding for masked-stores, except for the case when the mask is all ones (behaves like a regular store), or all zeros (trivial). Load after a masked-store generally waits for data to be sent to cache, so it should be pretty expensive.

user2059893
  • 409
  • 3
  • 10
  • Using scatter-gather to do multiple hashmap-counter updates in parallel is sounding like more and more of a fool's errand, between this and the difficulties of efficiently handling collisions... – TLW Nov 29 '22 at 02:49