page faulting maskmovdqu / _mm_maskmoveu_si128 - how to avoid?

Question

I have a function that streams out structured data. The data are Vec4/Vec3/Vec2/float-structures, so the maximum size is 16 bytes per structure. Now it may happen, that the stream is being read starting inside a structure. Simple solution: load structure, build a store-mask, decrease destination data pointer by how many bytes into our structure that call wants to start reading.

Imagine the current item type is Vec2, we are 4 bytes into this structure:

xmm0 = 00000000-00000000-dadadada-dadadada
xmm1 = 00000000-00000000-ffffffff-00000000
result_data_ptr = 13450000
-> RDI = 1344fffc
maskmovdqu xmm0, xmm1

=> result is a page fault exception.

Is there any way to detect this page fault will happen? The memory of the previous page won't even be touched ...

I would recommend avoiding `maskmovdqu` (it's weird and slow), but what that would entail depends on how you were using it. — harold, Oct 11 '19 at 17:14
Well, I tried to describe how it was used. Maybe not clearly enough. I get a data pointer from the caller => that is `result_data_ptr`. The stream object calculates, how many bytes inside the current item it is. Builds a store mask in xmm1, holds the item itself in xmm0, and `RDI:=result_data_ptr-bytes_inside`. Now in case `result_data_ptr` was a page boundary and the previous page didn't belong to my application memory space, I get that page fault. — St0fF, Oct 11 '19 at 17:27

score 2 · Accepted Answer · answered Oct 11 '19 at 17:39

maskmovdqu doesn't do fault-suppression, unlike AVX vmaskmovps or AVX512 masked stores. Those would solve your problem, although still maybe not the most efficient way.

As documented in Intel's ISA ref manual, with an all-zero mask (so nothing is stored to memory) Exceptions associated with addressing memory and page faults may still be signaled (implementation dependent).

With a non-zero mask, I assume it's guaranteed that it does page fault if the 16 bytes includes any non-writeable pages. Or maybe some implementations do the mask suppress faults even when some storing does happen (zeros in the unmapped page, but non-zero elsewhere)

It's not a fast instruction anyway on real CPUs.

maskmovdqu might have been good sometimes on single-core Pentium 4 (or not IDK), and/or its MMX predecessor was maybe useful on in-order Pentium. Masked cache-bypassing stores are much less useful on modern CPUs where L3 is the normal backstop, and caches are large. Perhaps more importantly, there's more machinery between a single core and the memory controller(s) because everything has to work correctly even if another core did reload this memory at some point, so a partial-line write is maybe even less efficient.

It's generally a terrible choice if you really are only storing 8 or 12 bytes total. (Basically the same as an NT store that doesn't write a full line). Especially if you're using multiple narrow stores to grab pieces of data and put them into one contiguous stream. I would not assume that multiple overlapping maskmovdqu stores will result in a single efficient store of a whole cache line once you eventually finish one, even if the masks mean no byte is actually written twice.

L1d cache is excellent for buffering multiple small writes to a cache line before it's eventually done; use that normal stores unless you can do a few NT stores nearly back-to-back.

To store the top 8 bytes of an XMM register, use movhps.

Writing into cache also makes it fine to do overlapping stores, like movdqu. So you can concatenate a few 12-byte objects by shuffling them each to the bottom of an XMM register (or loading them that way in the first place), then use movdqu stores to [rdi], [rdi+12], [rdi+24], etc. The 4-byte overlap is totally fine; coalescing in the store buffer may absorb it before it even commits to L1d cache, or if not then L1d cache is still pretty fast.

At the start of writing a large array, if you don't know the alignment you can do an unaligned movdqu of the first 16 bytes of your output. Then do the first 16-byte aligned store possibly overlapping with that. If your total output size is always >= 16 bytes, this strategy doesn't need a lot of branching to let you do aligned stores for most of it. At the end you can do the same thing with a final potentially-unaligned vector that might partially overlap the last aligned vector. (Or if the array is aligned, then there's no overlap and it's aligned too. movdqu is just as fast as movdqa if the address is aligned, on modern CPUs.)

I did read those remarks in the manual, that's why I was asking in the first place. But judging from your answer, there is no proper way. I'll post a self-answer to show how to get around in this particular setting. — St0fF, Oct 11 '19 at 18:18
@St0fF: there are some "proper" ways, e.g. AVX `vmaskmovps`. Or SSE4.1 `extractps [rdi], xmm0, 2` will store just the 3rd element as a dword store, not a masked 16-byte store. Or with SSE1, `shufps` + `movss`. — Peter Cordes, Oct 11 '19 at 18:21
I meant proper ways to detect maskmovdqu will produce a page fault... — St0fF, Oct 11 '19 at 18:32
@St0fF: Oh. You can just detect page-crossing by checking if `p&0xfff >= (4096-15)`. i.e. check the page-offset bits of the address. You probably want to avoid a page-split anyway, even if the previous page is mapped. Or did you not mean determine whether `maskmovdqu` will do fault-suppression in general on one CPU vs. another? The way the manual is worded, that's not implied at all. With a non-zero mask, nothing is mentioned about suppressing faults so we shouldn't expect that they are. You could check the vol3 manual; the ISA ref manual entry doesn't always have everything. — Peter Cordes, Oct 11 '19 at 18:54
@St0fF: but re: runtime detection to avoid it for page-splits. More likely it would be better to always avoid `maskmovdqu`, even when it wouldn't be a page-split. But page-splits in general are extra slow on Intel CPUs before Skylake, like 100 extra cycles. (At least page-split loads have a penalty like that, I forget if I've tested stores. They're not great even on Skylake, still more expensive that other cache-line-split stores. At least for regular cacheable stores. I haven't tested maskmov.) — Peter Cordes, Oct 11 '19 at 19:00
I'm marking your answer as "the answer", as it simply contains so much very useful information. — St0fF, Oct 12 '19 at 21:52

score 0 · Answer 2 · answered Oct 11 '19 at 19:05

Well, as it seems there's no good way to predict the page fault, I went the other way. This is a straight asm solution:

First, we use a table to shift the result according to bytes_inside. Then we find out how many bytes are to be written. As at most 15 bytes need to be written, this works as a 4-stage-process. We simply test the bits of bytes_to_write - if the "8" bit (i.e. bit 3) is set, we use a movq. bit 2 requires a movd, bit 1 a pextrw and bit 0 a pextrb. After each store, the data pointer is incremented accordingly and the data register is shifted accordingly.

Registers:

r10: result_data_ptr
r11: bytes_inside
xmm0.word[6]: size of data item
xmm2: our data item
shuf_inside: data table for rotating an xmm register bytewise using pshufb (psrldq only allows immediate byte-shift counts)

.DATA
ALIGN 16
shuf_inside   byte 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,0
              byte 2,3,4,5,6,7,8,9,10,11,12,13,14,15,0,1
              byte 3,4,5,6,7,8,9,10,11,12,13,14,15,0,1,2
              byte 4,5,6,7,8,9,10,11,12,13,14,15,0,1,2,3
              byte 5,6,7,8,9,10,11,12,13,14,15,0,1,2,3,4
              byte 6,7,8,9,10,11,12,13,14,15,0,1,2,3,4,5
              byte 7,8,9,10,11,12,13,14,15,0,1,2,3,4,5,6
              byte 8,9,10,11,12,13,14,15,0,1,2,3,4,5,6,7
              byte 9,10,11,12,13,14,15,0,1,2,3,4,5,6,7,8
              byte 10,11,12,13,14,15,0,1,2,3,4,5,6,7,8,9
              byte 11,12,13,14,15,0,1,2,3,4,5,6,7,8,9,10
              byte 12,13,14,15,0,1,2,3,4,5,6,7,8,9,10,11
              byte 13,14,15,0,1,2,3,4,5,6,7,8,9,10,11,12
              byte 14,15,0,1,2,3,4,5,6,7,8,9,10,11,12,13
              byte 15,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
.CODE
[...]
        lea             rax,        [ shuf_inside ]
        shl             r11,        4
        pshufb          xmm2,       [ rax + r11 - 16 ]
        shr             r11,        4
        pextrw          rax,        xmm0,       6           ;reducedStrideWithPadding - i.e. size of item
        sub             rax,        r11                     ;bytes_to_write
        ;
        test            rax,        8
        jz              lessThan8
        movq            qword ptr [r10], xmm2
        psrldq          xmm2,       8
        add             r10,        8
        lessThan8:
        test            rax,        4
        jz              lessThan4
        movd            dword ptr [r10], xmm2
        psrldq          xmm2,       4
        add             r10,        4
        lessThan4:
        test            rax,        2
        jz              lessThan2
        pextrw          word ptr [r10], xmm2, 0
        psrldq          xmm2,       2
        add             r10,        2
        lessThan2:
        test            rax,        1
        jz              lessThan1
        pextrb          byte ptr [r10], xmm2, 0
        lessThan1:

If you care about code-size, `test al, 8` instead of RAX. 2 bytes vs. 6 because there's no version of `test` with a sign-extended 8-bit immediate for 32 or 64-bit operand-size. — Peter Cordes, Oct 11 '19 at 20:14
But really this looks pretty inefficient. I'd consider loading 16 bytes from the destination, then blend in your new bytes and store. Introduces a non-atomic RMW, but presumably no other thread is writing the same location at the same time. You can create a blend mask with an unaligned load that spans the boundary between `0xffff...` and `0x0000...`. See [Vectorizing with unaligned buffers: using VMASKMOVPS: generating a mask from a misalignment count?](//stackoverflow.com/q/34306933) for an example of the 31-byte sliding-window thing instead of 16x 16-byte vectors. — Peter Cordes, Oct 11 '19 at 20:23
You're already using SSE4.1 `pextrb`, so you can use `pblendvb`. Doing a variable-count byte-shift is a pain, but you can build it out of `pshufb` like you're doing now, or maybe branch on shift > 8 or not and build it out of `psrlq` / `psllq` + shuffle and `por`. Probably a sliding window lookup for a `pshufb` control vector to emulate variable-count `psrldq` is good, shifting in zeros (high bit of control vec). You can use the same table twice: once to put the data in the right place in your vector, and again to create a blend mask from an all-ones vector (created with `pcmpeqd same,same`) — Peter Cordes, Oct 11 '19 at 20:28
Thank you very much. All these hints sound really interesting. Up until now I have restrained myself to not use any AVX or Vxxxx commands, as at least one of my target systems still has Nehalem-architecture. Anyhow, I also like the idea of 2 masked loads, blending and then writing via movdqu. Sounds much better then above solution, and it removes the need for maskmovdqu. — St0fF, Oct 12 '19 at 21:16
Also, I had another thought about my function: it's always called at the start of a much larger operation, thus it doesn't even matter if the following bytes get clobbered. Following operations will store the right data there, most likely while that cache line has not even been committed, i.e. shortly after. — St0fF, Oct 12 '19 at 21:26
Yup, one of the paragraphs in my answer covered that. L1d cache works great as a write-combining buffer to absorb overlapping vector stores. (BTW, I usually use the term "commit" for store-buffer -> L1d cache because that's the point where it becomes globally visible. Cache is coherent. Unless you're talking about persistent memory like non-volatile DIMMs, in L1d = committed. The term you want is "before the cache line is *evicted* from L1d.) Or merging can even happen in the store buffer before commit to L1d, for back-to-back stores to the same line. — Peter Cordes, Oct 12 '19 at 21:35

page faulting maskmovdqu / _mm_maskmoveu_si128 - how to avoid?

2 Answers2