Assembly code for optimized bitshifting of a vector

Question

i'm trying to write a routine that will logically bitshift by n positions to the right all elements of a vector in the most efficient way possible for the following vector types: BYTE->BYTE, WORD->WORD, DWORD->DWORD and WORD->BYTE (assuming that only 8 bits are present in the result). I would like to have three routines for each type depending on the type of processor (SSE2 supported, only MMX suppported, only standard instruction se supported). Therefore i need 12 functions in total.

I have already found by myself how to backup and restore the registers that i need, how to make a loop, how to copy data into regular registers or MMX registers and how to shift by 1 position logically.

Because i'm not familiar with assembly language that's about it. Which registers should i use for each instruction set? How will the availability of the large vector (an image) in L1 cache be optimized? How do i find the next element of the vector (a pointer kind of thing), i know i can make a mov by address and i assume i have to increment the address by 1, 2 or 4 depending on my type of data?

Although i have all the ideas, writing the code is a bit difficult at this point.

Thank you.

Arnaud.

Edit: Here is what i'm trying to do for MMX for a shift by 1 on a DWORD:

__asm("push mm"); // backup register
__asm("push cx"); // backup register
__asm("mov %cx, length"); // initialize loop
__asm("loopstart_shift1:"); // start label
__asm("movd %xmm0, r/m32"); // get 32 bits data
__asm("psrlq %xmm0, 1"); // right shift 32 bits data logically (stuffs 0 on the left) by 1
__asm("mov r/m32,%xmm0"); // set 32 bits data
__asm("dec %cx"); // decrement index
__asm("cmp %cx,0");
__asm("jnz loopstart_shift1");
__asm("pop cx"); // restore register
__asm("pop mm"); // restore register
__asm("emms"); // leave MMX state

I've answered this somewhere. Basically you rotate every array element and then use masking and xor-ing to copy bits from each element to the next. And of course you unroll the loops. — Mike Dunlavey, Jun 24 '11 at 13:30

score 1 · Answer 1 · answered Jun 24 '11 at 10:32

1

I strongly suggest you pause and take a look at using intrinsics with C or C++ instead of trying to write raw asm - that way the C/C++ compiler will take care of all the register allocation, instruction scheduling and general housekeeping tasks and you can just focus on the important parts, e.g. instead of using psrlq see _m_psrlq in mmintrin.h. (Better yet, look at using 128 bit SSE intrinsics.)

answered Jun 24 '11 at 10:32

Paul R

208,748
37
389
560

I don't think i can move 4(8) 16bit WORD into one 64(128bits) MMX(SSE) register efficiently without assembly; intrinsics can replace the shifting but the 64bits(128bits) word will only contain one pixel padded with zeros – Arnaud Jun 24 '11 at 10:58
@Arnuad: there are a host of ways for loading up packed registers, like `_mm_set_epi8` (http://msdn.microsoft.com/en-us/library/x0cx8zd3(v=VS.80).aspx), else you just need to pack the data you start with in a better way. – Necrolis Jun 24 '11 at 13:06
@Arnaud: every MMX/SSE instruction is available through an intrinsic, so if you can do it in asm then you can do it with intrinsics (except that it will be a lot easier in the latter case) – Paul R Jun 24 '11 at 13:40
Will they have the exact same performance as writing MMX or SSE code directly? – Arnaud Jun 25 '11 at 09:19
@Arnaud: often intrinsics will give better performance, since you get the benefit of the compiler's register allocation, instruction scheduling, and back end optimisations. It also means that you can target different CPU families and get optimal code on each without re-writing any code - this is particuarly significant for x86-64, where asm will normally need to be compeletely re-written. – Paul R Jun 25 '11 at 09:49
Ok, so could someone give me a full example for SSE2 (or 4) and MMX for the above so that i can derive the others from it? – Arnaud Jun 25 '11 at 13:27
I think for the example above you just need one intrinsic: _mm_srli_epi16. – Paul R Jun 26 '11 at 05:19

score 0 · Answer 2 · answered Jun 24 '11 at 13:07

0

Sounds like you'd benefit from either using or looking into BitMagic's source. its entirely intrinsics based too, which makes its far more portable (though from the looks of it your using GCC, so it might have to get an MSVC to GCC intrinics mapping).

answered Jun 24 '11 at 13:07

Necrolis

25,836
3
63
101

I have looked at this library but it does not contain that many intrinsics and the bitshifting functions are written in plain C and marked as todo for optimization. Useless at this point. – Arnaud Jun 25 '11 at 09:18

Assembly code for optimized bitshifting of a vector

2 Answers2