websocket spec defines unmasking data as
j = i MOD 4
transformed-octet-i = original-octet-i XOR masking-key-octet-j
where mask is 4 bytes long and unmasking has to be applied per byte.
Is there a way to do this more efficiently, than to just loop bytes?
Server running the code can assumed to be a Haswell CPU, OS is Linux with kernel > 3.2, so SSE etc are all present. Coding is done in C, but I can do asm as well if necessary.
I'd tried to look up the solution myself, but was unable to figure out if there was an appropriate instruction in any of the dozens of SSE1-5/AVE/(whatever extension - lost track of the many over the years)
Thank you very much!
Edit: After rereading the spec a couple of times it seems that it's actually only XOR'ing the data bytes with the mask bytes, which I can do 8 bytes at a time till the last few bytes. Question is still open, as I think there could probably be still a way to optimize this using SSE or the like (maybe processing even 16 bytes at a time? letting the process do the for loop? ...)