How to cast to unsigned vector type after using __builtin_msa_ld_*

Question

I'm evaluating MIPS SIMD Architecture (MSA) programming using the Codescape GCC Toolchain. There's not much information out there about MSA and builtins. (As far as I can tell there's only two MSA cpu's, the P5600 and Warrior I6400, and they first became available several years ago).

My test program is below.

#include <msa.h>
#include <stdint.h>

#define ALIGN16 __attribute__((aligned(16)))

int main(int argc, char* argv[])
{
    ALIGN16 uint32_t a[] = {64, 128, 256, 512};
    ALIGN16 uint32_t b[] = {1024, 2048, 4096, 8192};
    ALIGN16 uint32_t c[4];

    v4u32 va = __builtin_msa_ld_w (a, 0);
    v4u32 vb = __builtin_msa_ld_w (b, 0);

    v4u32 vc = __builtin_msa_adds_u_w (va, vb);
    __builtin_msa_st_w (vc, c, 0);

    return 0;
}

Compiling the program results in the errors shown below. The problem is, the vector loads return a signed vector but my vectors are unsigned. I have a similar problem with the vector stores.

// The 4 vector loads provided through builtins
v16i8 __builtin_msa_ld_b (void *, imm_n512_511);    // byte
v8i16 __builtin_msa_ld_h (void *, imm_n1024_1022);  // half word
v4i32 __builtin_msa_ld_w (void *, imm_n2048_2044);  // word
v2i64 __builtin_msa_ld_d (void *, imm_n4096_4088);  // double word

(The imm_n512_511 and friends is discussed in the GCC manual at 6.59.16 MIPS SIMD Architecture (MSA) Support).

I read MIPS paper(?) at MIPS SIMD Architecture but it does not discuss how to convert between integral vector types. There are lots of floating-point conversion instructions, but nothing for integral types.

Is a simple cast the preferred way to convert between integral vector types? Or is there something else I should be doing?

MSA$ mips-img-linux-gnu-gcc.exe -mmsa test.c -c
test.c: In function 'main':
test.c:12:2: note: use -flax-vector-conversions to permit conversions between ve
ctors with differing element types or numbers of subparts
  v4u32 va = __builtin_msa_ld_w (a, 0);
  ^~~~~
test.c:12:13: error: incompatible types when initializing type 'v4u32 {aka __vec
tor(4) unsigned int}' using type '__vector(4) int'
  v4u32 va = __builtin_msa_ld_w (a, 0);
             ^~~~~~~~~~~~~~~~~~
test.c:13:13: error: incompatible types when initializing type 'v4u32 {aka __vec
tor(4) unsigned int}' using type '__vector(4) int'
  v4u32 vb = __builtin_msa_ld_w (b, 0);
             ^~~~~~~~~~~~~~~~~~
test.c:16:22: error: incompatible type for argument 1 of '__builtin_msa_st_w'
  __builtin_msa_st_w (vc, c, 0);
                      ^~
test.c:16:22: note: expected '__vector(4) int' but argument is of type 'v4u32 {a
ka __vector(4) unsigned int}'

why don't you use `va` and `vb` instead of `a` and `b` ? Also doc say, "The load/store instructions do not require 128-bit (16-byte) memory address alignment.", so I don't think you need `ALIGN16`. I don't think you need to worry "The MSA complements the well-established MIPS architecture with a set of more than 150 new instructions operating on 32 vector registers of 8-, 16-, 32-, and 64-bit integer, 16-and 32-bit fixed- point, or 32- and 64-bit floating-point data elements....", look like a simple cast will do the job correctly (if of course the real integer are unsigned). — Stargateur, Oct 21 '18 at 06:20
Thanks @Stargateur - The `ALIGN16` came from [MIPS SIMD Architecture](https://www.mips.com/downloads/mips-simd-architecture/), Section 5.1 Vector Data Types and Intrinsics, page 10: *"It is recommended aligning the vector data to the size of the vector registers"*. I think the importance of *"size of the vector registers"* is, DSP has 64-bit vector registers, while MSA has 128-bit vector registers. — jww, Oct 21 '18 at 06:25
Well, recommanded is different of required, maybe it's better to align them, I didn't read all documentation. Also I think they didn't add specific instruction for loading unsigned to save some instruction, because I suppose their load work on both sign. Maybe, you should add yourself the wrapper function that will cast the vector for you when you load and store unsigned integer. — Stargateur, Oct 21 '18 at 06:28

score 2 · Accepted Answer · answered Oct 21 '18 at 08:16

Either you use casts and -flax-vector-conversions, or use an union type to represent the vector registers and explicitly work on that union type. GCC explicitly supports that form of type-punning.

For example, you could declare an msa128 type,

typedef union __attribute__ ((aligned (16))) {
    v2u64   u64;
    v2i64   i64;
    v2f64   f64;
    v4u32   u32;
    v4i32   i32;
    v4f32   f32;
    v8u16   u16;
    v8i16   i16;
    v16u8   u8;
    v16i8   i8;
} msa128;

and then have your code work explicitly on the msa128 type. Your example program could be written as

    uint32_t a[4] = { 64, 128, 256, 512 };
    uint32_t b[4] = { 1024, 2048, 4096, 8192 };
    uint32_t c[4];
    msa128   va, vb, vc;

    va.i32 = __builtin_msa_ld_w(a, 0);
    vb.i32 = __builtin_msa_ld_w(b, 0);
    vc.u32 = __builtin_msa_adds_u_w(va.u32, vb.u32);
    __builtin_msa_st_w(vc.i32, c, 0);

Obviously, it becomes quite annoying to remember the exact type one needs to use, so some static inline helper functions would definitely be handy:

static inline msa128  msa128_load64(const void *from, const int imm)
{ return (msa128){ .i64 = __builtin_msa_ld_d(from, imm); } }

static inline msa128  msa128_load32(const void *from, const int imm)
{ return (msa128){ .i32 = __builtin_msa_ld_w(from, imm); } }

static inline msa128  msa128_load16(const void *from, const int imm)
{ return (msa128){ .i16 = __builtin_msa_ld_h(from, imm); } }

static inline msa128  msa128_load8(const void *from, const int imm)
{ return (msa128){ .i8  = __builtin_msa_ld_b(from, imm); } }

static inline void  msa128_store64(const msa128 val, void *to, const int imm)
{ __builtin_msa_st_d(val.i64, to, imm); }

static inline void  msa128_store32(const msa128 val, void *to, const int imm)
{ __builtin_msa_st_w(val.i32, to, imm); }

static inline void  msa128_store16(const msa128 val, void *to, const int imm)
{ __builtin_msa_st_h(val.i16, to, imm); }

static inline void  msa128_store8(const msa128 val, void *to, const int imm)
{ __builtin_msa_st_b(val.i8, to, imm); }

For example, the binary AND, OR, NOR, and XOR operations are

static inline msa128  msa128_and(const msa128 a, const msa128 b)
{ return (msa128){ .u8 = __builtin_msa_and_v(a, b) }; }

static inline msa128  msa128_or(const msa128 a, const msa128 b)
{ return (msa128){ .u8 = __builtin_msa_or_v(a, b) }; }

static inline msa128  msa128_nor(const msa128 a, const msa128 b)
{ return (msa128){ .u8 = __builtin_msa_nor_v(a, b) }; }

static inline msa128  msa128_xor(const msa128 a, const msa128 b)
{ return (msa128){ .u8 = __builtin_msa_xor_v(a, b) }; }

It probably wouldn't hurt creating some macros to represent the vectors in array form:

#define  MSA128_U64(...)  ((msa128){ .u64 = { __VA_ARGS__ }})
#define  MSA128_I64(...)  ((msa128){ .i64 = { __VA_ARGS__ }})
#define  MSA128_F64(...)  ((msa128){ .f64 = { __VA_ARGS__ }})
#define  MSA128_U32(...)  ((msa128){ .u32 = { __VA_ARGS__ }})
#define  MSA128_I32(...)  ((msa128){ .i32 = { __VA_ARGS__ }})
#define  MSA128_F32(...)  ((msa128){ .f32 = { __VA_ARGS__ }})
#define  MSA128_U16(...)  ((msa128){ .u16 = { __VA_ARGS__ }})
#define  MSA128_I16(...)  ((msa128){ .i16 = { __VA_ARGS__ }})
#define  MSA128_U8(...)   ((msa128){ .u8  = { __VA_ARGS__ }})
#define  MSA128_I8(...)   ((msa128){ .i8  = { __VA_ARGS__ }})

The reason I suggest this GCC-specific approach is that the builtins are GCC specific anyway. Other than the union type, it is very close to how GCC implements Intel/AMD vector intrinsics in <immintrin.h>.

Thanks @Nominal. Eventually I need to use this in C++, so I can't use the union trick. (Sorry abut that. I refrained from adding C++ tag because of the folks who comment C and C++ are different languages...). — jww, Oct 21 '18 at 13:09
@jww your question is the typical case where tag both C and C++ will be wrong, if you need answer for both language, you must create two question and a C++ expert or C will decide if the answer is correct for both C and C++, and will juste close as duplicate. For example, I could answer (and I did but in comment) the C part but I could not be sure for the C++ part as I never read the C++ standard. Unless your question is asking for compatibility issue (a specific question), don't tag both language. — Stargateur, Oct 21 '18 at 13:54
Thanks @nominal. Yes, the union trick can be undefined behavior in C++ *if* you access the inactive member (which is relied upon here). Also see [Accessing inactive union member and undefined behavior?](https://stackoverflow.com/q/11373203/608639) I found I can squash the warnings in a manner that keeps C and C++ happy by taking the address of the vector variable and using a `memcpy`, but it is a code wart. It is too bad MIPS does not have casts like ARM intrinsics (like `vreinterpretq_i32_u32`). — jww, Oct 21 '18 at 14:42
Thanks again @nominal. If a MIPS expert comes along and offers something better I may need to move the accept. — jww, Oct 21 '18 at 18:26
@jww: No worries! To me, only the applicability/usefulness of the answer matters. (I admit, I am one of those to whom C and C++ differences matter -- but in my defense, it is because the answers are (or should be) completely different, to be efficient. This *is* one of those cases.) As I use C for my low-level code, and haven't used the vector intrinsics in C++, I'm not sure what the best solution in C++ is, but I'll take a look later today, and edit my answer if I find anything better than the memcpy() way. — Nominal Animal, Oct 22 '18 at 02:48

score 0 · Answer 2 · answered Oct 21 '18 at 18:25

Here is an alternative that works with both C and C++. It performs a memcpy on the register variables. The inline functions borrow from ARM NEON support. ARM provides casts for the NEON vectors, like vreinterpretq_u64_u8. The inline on the functions requires C99.

#include <msa.h>
#include <stdint.h>
#include <string.h>

inline v4i32 reinterpretq_i32_u32(const v4u32 val) {
    v4i32 res;
    memcpy(&res, &val, sizeof(res));
    return res;
}

inline v4u32 reinterpretq_u32_i32(const v4i32 val) {
    v4u32 res;
    memcpy(&res, &val, sizeof(res));
    return res;
}

#define ALIGN16 __attribute__((aligned(16)))

int main(int argc, char* argv[])
{
    ALIGN16 uint32_t a[] = {64, 128, 256, 512};
    ALIGN16 uint32_t b[] = {1024, 2048, 4096, 8192};
    ALIGN16 uint32_t c[4];

    v4u32 va = reinterpretq_u32_i32(__builtin_msa_ld_w (a, 0));
    v4u32 vb = reinterpretq_u32_i32(__builtin_msa_ld_w (b, 0));

    v4u32 vc = __builtin_msa_adds_u_w (va, vb);
    __builtin_msa_st_w (reinterpretq_i32_u32(vc), c, 0);

    return 0;
}

And a compile at -O3 (it is clean at -Wall -Wextra):

MSA$ mips-img-linux-gnu-gcc.exe -O3 -mmsa test.c -c
MSA$

And a disassembly looks like it passes the sniff test:

MSA$ mips-img-linux-gnu-objdump.exe --disassemble test.o

test.o:     file format elf32-tradbigmips

Disassembly of section .text:

00000000 <main>:
   0:   27bdffc8        addiu      sp,sp,-56
   4:   3c020000        lui        v0,0x0
   8:   24420000        addiu      v0,v0,0
   c:   78001062        ld.w       $w1,0(v0)
  10:   3c020000        lui        v0,0x0
  14:   24420000        addiu      v0,v0,0
  18:   78001022        ld.w       $w0,0(v0)
  1c:   79c10010        adds_u.w   $w0,$w0,$w1
  20:   7802e826        st.w       $w0,8(sp)
  24:   93a2000b        lbu        v0,11(sp)
  28:   03e00009        jr         ra
  2c:   27bd0038        addiu      sp,sp,56

For completeness, GCC 6.3.0:

MSA$ mips-img-linux-gnu-gcc.exe --version
mips-img-linux-gnu-gcc.exe (Codescape GNU Tools 2017.10-05 for MIPS IMG Linux) 6.3.0
Copyright (C) 2016 Free Software Foundation, Inc.

I wonder if using `typedef char msa128 __attribute__((vector_size (16), aligned (16)));` as the basic vector type suffices. Since it is a character type, GCC allows it to alias any other type, so you can cast between e.g. `v2u64 a;` and `v4f32 b;` using `a = (v2u64)((msa128)(b));` and `b = (v4f32)((msa128)(a));` (in both g++ (`-std=gnu++14`) and gcc (`-std=gnu11`) using GCC 5.4.0). — Nominal Animal, Oct 22 '18 at 10:38

How to cast to unsigned vector type after using __builtin_msa_ld_*

2 Answers2