memory not aligned properly?

Question

I'm trying to use aligned operations in SSE and I'm having an issue (surprise).

typedef struct _declspec(align(16)) Vec4 {  
    float x;  
    float y;  
    float z;  
    float w;  
};

Vec4 SSE_Add(const Vec4 &a, const Vec4 &b) {  
    _declspec(align(16)) Vec4 return_val;  

    _asm { 
        MOV EAX, a                    // Load pointers into CPU regs
        MOV EBX, b

        MOVAPS XMM0, [EAX]            // Move unaligned vectors to SSE regs
        MOVAPS XMM1, [EBX]

        ADDPS XMM0, XMM1              // Add vector elements
        MOVAPS [return_val], XMM0     // Save the return vector
    }

    return return_val;
}

I get an access violation at return return_val. Is this an alignment issue? How can I correct this?

*"Is this an alignment issue?"* - Only reason I could imagine would be that your compiler isn't able to properly align the stack, which nowadays shouldn't be the case. What is your compiler and the corresponding flags? — Christian Rau, Jun 06 '13 at 18:55
@christian Rau- using Visual Studio 2010... so it should work just fine. — Lee Jacobs, Jun 06 '13 at 18:57
@500-ISE - I'm not certain if it works with unaligned store. I'll check when I get home later tonight. — Lee Jacobs, Jun 06 '13 at 18:58
As a side note, do your own sanity, your code's clarity, your program's portability *and* your application's performance a favor and rather use [instrinsics](http://msdn.microsoft.com/en-us/library/t467de55%28v=vs.100%29.aspx) for SSE operations instead of inline assembly and manual alignment attributes. — Christian Rau, Jun 06 '13 at 19:03
@Christian Rau - Do intrinsics really improve performance? This is going to be part of a small attempt to profile Vector operation speeds. I was planning to compare SSE inline assembly, standard C operations and GLM operations on vectors. I do understand that intrinsics are more readable (though a lot of SSE assembly is pretty easy to understand) but I'd like to profile these operations from every perspective. — Lee Jacobs, Jun 06 '13 at 19:09
@LeeJacobs If already profiling, then just include instrinsics in your tests and see what happens. But generally speaking, instrinsics are known to the compiler and he is perfectly aware of what they do, so he has every possibility to reorder and optimize them, with inlining even across multiple functions (and even I was surprised what beatiful code VS2010 made out of a bunch of SSE intrinsics wrapped into multiple functions). On the other hand an assembly block is a complete black box for the compiler, with any optimization completely up to you. — Christian Rau, Jun 06 '13 at 19:15
If you're really planning on actually using SSE in a reasonable vector abstraction library (as GLM does), then intrinsics are the only way to do it and (together with proper inlining, of course) bring perfect and clean abstraction at completely no cost, whereas I guess your `SSE_Add` function will be much worse than a simple `float` loop (at least when paired with more such opaque functions). — Christian Rau, Jun 06 '13 at 19:18
@LeeJacobs Related post, with a comparison of code generated by VC2010 from a simple instrinsics-based vector library: http://stackoverflow.com/a/10851231/743214. I'd wonder if that was achievable by inline assembly (except from actually hardcoding the whole block). — Christian Rau, Jun 06 '13 at 19:26
@LeeJacobs Intrinsics generally improve performance because they let the compiler take care of coloring registers and scheduling instructions to avoid pipeline bubbles, which are mechanical tasks that machines can perform optimally. (Unlike many other tasks which smart humans do way better than dumb compilers.) You can learn more about that process here: http://www.csl.cornell.edu/courses/ece4750/handouts/ece4750-T03-pipelining-hazards-struct-data.pdf — Crashworks, Jun 06 '13 at 19:33

catscradle · Accepted Answer · 2013-06-06T19:21:41.553

2

I found out that the problem is with EBX register. If you push/pop EBX, then it works. I'm not sure why though, so if anyone can explain this - please do.

Edit: I've looked into the disassembly and at the beginning of a function it stores stack pointer in the EBX:

mov ebx, esp

So you better make sure not to lose it.

edited Jun 06 '13 at 19:21

answered Jun 06 '13 at 19:12

catscradle

1,709
11
18

1

Sounds like the answer, messing up the stack pointer is enough for creating access violation madness at the return statement. (Another reason to keep your hands off inline assembly.) – Christian Rau Jun 06 '13 at 19:39
This is one strong reason to prefer intrinsics to inline assembly: it avoids inadvertently clobbering registers that the compiler uses for other things. Conversely, if you intend to use assembly, you should learn the platform's ABI and all its calling conventions by heart. – Crashworks Jun 06 '13 at 19:42
Saving the stack pointer is part of the function prolog and it is to make a stack frame. It makes the function activation record accessible at a fixed address (in this case ebp+something). However, it is usually the EBP register that has this role... So... this is a bit compiler specific. – migle Sep 29 '16 at 12:48

score 0 · Answer 2 · answered Jun 06 '13 at 19:09

0

This is a bit compiler dependent... Isn't the correct thing to write: movaps return_val, xmm0

Why don't you show us the generated code?

The way you are writing this is a lot worse than if you let the compiler do it on its own.

This function should be inlinable and translate to a single instruction in the best case, if you write it like this it cannot be inlined.
This function could receive its arguments in registers in Intel 64 and return its result in a register, if you write it like this you force using the stack.
This function could use return value optimization, writing it like this forces you to write xmm0 to the return_val variable which will have to be copied a second time.

So... aligned versus unaligned MOVPS is your least concern.

Why not just, in portable code, write:

inline void add(const float *__restrict__ a, const float *__restrict__ b, float *__restrict__ r)
{
    for (int i = 0; i != 4; ++i) r[i] = a[i] + b[i];
}

answered Jun 06 '13 at 19:09

migle

881
8
11

1

Or, in equally portable, but actually optimized for SSE, code `_mm_store_ps(r, _mm_add_ps(_mm_load_ps(a), _mm_load_ps(b)))`. I would wonder if the compiler would really turn that loop into an SSE instruction (well, maybe *Intel*'s). – Christian Rau Jun 06 '13 at 19:20
1

I wouldn't call this code "portable", when it uses `__restrict__` – Ben Voigt Jun 06 '13 at 19:24
@Crashworks: The version with intrinsics performs just fine without any modifiers, because it copies the prior value into an MMX/SSE register before making any writes, it doesn't have to worry what happens if source and destination overlap. – Ben Voigt Jun 06 '13 at 20:49
@BenVoigt Intriniscs aren't portable at all -- they are specific to an instruction set. – Crashworks Jun 06 '13 at 21:06
@Crashworks: And may or may not have the same name in compilers targeting that instruction set but made by different vendors. So yeah, not portable. – Ben Voigt Jun 06 '13 at 21:10
@BenVoigt `restrict` is a keyword in C. In C++, they still haven't worked it out, they are looking for a deeper, more general way to avoid undesired dependences. In the meanwhile, C/C++ compilers usually allow `__restrict__` in C++ mode and use it just as in C. – migle Sep 29 '16 at 12:53
@ChristianRau compilers do vectorize this. Even VC++! The reason why they usually don't is that without `restrict` they must assume aliasing and unknown dependencies, and therefore cannot reorder the statements. Using intrinsics forces the compiler to use those specific instructions, even though optimizations could be possible, such as eliminating temporaries, etc. – migle Sep 29 '16 at 12:59

memory not aligned properly?

2 Answers2