Clang built-in vector type extension: alignment question and checking that it actually works

Question

I am interested in using the clang vector extension such as:

typedef float vec3 __attribute__((ext_vector_type(3)));

I have 2 questions:

As you can see in the example above and below, I am primarily interested in using them for manipulating vec3 vector (xyz). My understanding is that typically the size of a SIMD unit is 128 bits. If I were to use a vec3 then this would take 96 bits. So I am wondering if there's a penalty for not using exactly 128-bits or if I were to use vec2, maybe the compiler would be able to pack two vec2 in the unit? Should I use vec4 instead, even if I won't be using the fourth element in most cases? Is it better from an alignment/performance standpoint?
I would like to eventually "measure" how much more efficient using these extensions is vs. using standard structs. Besides running them in a loop a great number of times (and measuring time), I don't know of any other way but that seems very naïve. That's not even very informative in the case of the small example I provide below because when I compile this with the -O3 the code runs really super fast either way. Can I also somehow say that these are optimized by looking at the generated ASM code (I tried and even though the code is rather short, the ASM-generated code is already quite long and besides understanding the basics this is a bit overwhelming)? Suggestions would be greatly appreciated where my goal is essential to prove to myself) that using these extensions produces an executable that runs faster.

typedef float vec3 __attribute__((ext_vector_type(3)));
struct vec3f { float x, y, z; };

int main(int argc, char **argv)
{
    for (unsigned long i = 0; i  < 1e12; ++i) {
        for (unsigned long j = 0; j  < 1e12; ++j) {
#if 1
            vec3 a = {1, 0, 0};
            vec3 b = {0, 1, 0};

            vec3 lhs = a.yzx * b.zxy;
            vec3 rhs = a.zxy * b.yzx;

            vec3 c = lhs - rhs;
#else
            vec3f a = {1, 0, 0};
            vec3f b = {0, 1, 0};
            vec3f c;
            c.x = a.y * b.z - a.z * b.y;
            c.y = a.z * b.x - a.x * b.z;
            c.z = a.x * b.y - a.y * b.x;
#endif
            //printf("%f %f %f\n", c.x, c.y, c.z);
        }
    }
    return EXIT_SUCCESS;
}

Look at the generated asm for some ISA you care about, e.g. x86-64, on https://godbolt.org/ with clang `-O3 -march=haswell` or whatever. I'd assume `vec3` compiles rather poorly, especially for stores to avoid writing the high 4 bytes of a 16-byte register when storing the rest. But also loads if the compiler doesn't know it can safely read 4 bytes past the end. — Peter Cordes, Jul 04 '22 at 10:11
If you can, use vectors of `{x0, x1, x2, x3}`, `{y0, y1, y2, y3}`, etc. so you can do things other than add without shuffling. (e.g. get the magnitude of 4 vectors in parallel.) See https://stackoverflow.com/tags/sse/info especially https://deplinenoise.wordpress.com/2015/03/06/slides-simd-at-insomniac-games-gdc-2015/ where they explain how using 1 SIMD vector as 1 geometry vectors is slow. — Peter Cordes, Jul 04 '22 at 10:12

Clang built-in vector type extension: alignment question and checking that it actually works

0 Answers0