3

websocket spec defines unmasking data as

j                   = i MOD 4
transformed-octet-i = original-octet-i XOR masking-key-octet-j

where mask is 4 bytes long and unmasking has to be applied per byte.

Is there a way to do this more efficiently, than to just loop bytes?

Server running the code can assumed to be a Haswell CPU, OS is Linux with kernel > 3.2, so SSE etc are all present. Coding is done in C, but I can do asm as well if necessary.

I'd tried to look up the solution myself, but was unable to figure out if there was an appropriate instruction in any of the dozens of SSE1-5/AVE/(whatever extension - lost track of the many over the years)

Thank you very much!

Edit: After rereading the spec a couple of times it seems that it's actually only XOR'ing the data bytes with the mask bytes, which I can do 8 bytes at a time till the last few bytes. Question is still open, as I think there could probably be still a way to optimize this using SSE or the like (maybe processing even 16 bytes at a time? letting the process do the for loop? ...)

Paul R
  • 208,748
  • 37
  • 389
  • 560
griffin
  • 1,261
  • 8
  • 24

1 Answers1

7

Yes, you can XOR 16 bytes in one instruction using SSE2, or 32 bytes at a time with AVX2 (Haswell and later).

SSE2:

#include <emmintrin.h>                     // SSE2 instrinsics

__m128i v, v_mask;
uint8_t *buff;                             // buffer - must be 16 byte aligned

for (int i = 0; i < N; i += 16)            // note that N must be multiple of 16
{
    v = _mm_load_si128(&buff[i]);          // load 16 bytes
    v = _mm_xor_si128(v, v_mask);          // XOR with mask
    v = _mm_store_si128(&buff[i], v);      // store 16 masked bytes
}

AVX2:

#include <immintrin.h>                     // AVX2 intrinsics

__m256i w, w_mask;
uint8_t *buff;                             // buffer - must be 16 byte aligned,
                                           // and preferably 32 byte aligned

for (int i = 0; i < N; i += 32)            // note that N must be multiple of 32
{
    w = _mm256_load_si256(&buff[i]);       // load 32 bytes
    w = _mm256_xor_si256(w, w_mask);       // XOR with mask
    w = _mm256_store_si256(&buff[i], w);   // store 32 masked bytes
}
Paul R
  • 208,748
  • 37
  • 389
  • 560
  • 1
    Thank you very much! I'll accept this as answer as soon as I've tested it working! – griffin Jul 24 '13 at 10:40
  • Currently trying to put this into my code I found on google that pointers used should always be aligned to processing size - which means in case of AVX2 it should be 32 byte aligned - if I'm not wrong there, you might want to correct that in your code comments! – griffin Jul 24 '13 at 13:29
  • @griffin: it's slightly more complicated with AVX/AVX2 - SSE absolutely requires 16 byte alignment, whereas AVX/AVX2 will work OK with 16 byte alignment (even for 32 byte loads/stores), but may give better performance in some cases with 32 byte alignment. I've updated the comment. – Paul R Jul 24 '13 at 13:55
  • Better to just use an prologue/epilogue anyway and avoid bombing performance with unaligned loads on processors that don't deal well with it. Yet if you have AVX I suppose all processors support fast unaligned loads... (the ones I've encountered at least). – Aktau Jun 23 '14 at 11:22