what is the right way to access builtin_bswap functions?

Question

I have an application that uses a database with data stored in big-endian order. To access this data portably across hardware platforms, I use 4 macros defined in a config.h module:

word(p) - gets a big-endian 16 bit value at pointer p as a native 16-bit value.
putword(p, w) - stores a native 16-bit variable (w) to pointer p as 16-bit big-endian.
dword(p) and putdword(p, d) do the same for 32-bit values

This all works fine, but the macros on a little-endian machine use the brute-force 'shift and mask' approach.

Anyway, it looks like there are builtin_bswap16 and builtin_bswap32 functions on linux that may do this more efficiently (as inline assembler code?). So what's the right way to code my word/putword macros so that they use these builtin functions on an X86_64 linux machine? Would coding my macros as htons/l function calls do the same thing as efficiently - and is it necessary to enable compiler optimiation to get any of these solutions to work? I'd rather not optimize if it renders gdb useless.

See [endian(3)](https://man7.org/linux/man-pages/man3/endian.3.html) and [glibc endian.h](https://github.com/lattera/glibc/blob/master/string/endian.h) and [newlib endian.h](https://github.com/eblot/newlib/blob/master/newlib/libc/iconv/lib/endian.h) and [gspd/bits.h](https://github.com/biiont/gpsd/blob/master/bits.h#L45). `stores a native 16-bit variable (w) to pointer p as 16-bit big-endian.` is `p` guaranteed to be aligned to 16-bit? Is `w` guaranteed to be aligned to 16-bit? — KamilCuk, Apr 04 '21 at 22:28
Yeah, the 16- and 32-bit alignments are guaranteed. This database originated on an IBM Series/1, which is big-endian and does not support unaligned 16- and 32-bit data fetches/stores. In fact, the database is part of a system in which some small amount of legacy Series/1 assembler code is still run via a software emulation layer. That emulation layer is likely to reap way more benefit from fast byte-swapping than newer code that just has to byte-swap database fields... — littlenoodles, Apr 05 '21 at 17:27

score 0 · Answer 1 · answered Apr 04 '21 at 22:15

Hmmm. I wrote a trivial test program using no special include files and simply calling the __builtin_swap... functions directly (see the 'fast...' macros below). It all just works. When I disassemble the code in gdb, I see that the fast... macros do in 4-5 assembler instructions what takes up to 27 instructions for the worst case 'dword' macro. Pretty neat improvement for almost no effort.

typedef unsigned char uchar;
typedef unsigned short ushort;
typedef unsigned int uint;

#define word(a)       (ushort) ( (*((uchar *)(a)) << 8) |          \
                                 (*((uchar *)(a) + 1)) )
#define putword(a,w)  *((char *)(a))   =  (char) (((ushort)((w) >>  8)) & 0x00ff), \
                      *((char *)(a)+1) =  (char) (((ushort)((w) >>  0)) & 0x00ff)
#define dword(a) (uint)  ( ((uint)(word(a)) << 16) |      \
                             ((uint)(word(((uchar *)(a) + 2)))) )
#define putdword(a,d) *((char *)(a))   =  (char) (((uint)((d) >> 24)) & 0x00ff), \
                      *((char *)(a)+1) =  (char) (((uint)((d) >> 16)) & 0x00ff), \
                      *((char *)(a)+2) =  (char) (((uint)((d) >>  8)) & 0x00ff), \
                      *((char *)(a)+3) =  (char) (((uint)((d) >>  0)) & 0x00ff)

#define fastword(a)   (ushort) __builtin_bswap16(* ((ushort *) a));
#define fastputword(a, w)  *((ushort *) a) =  __builtin_bswap16((ushort)w);
#define fastdword(a)   (uint) __builtin_bswap32(* ((uint *) a));
#define fastputdword(a, d)  *((uint *) a) =  __builtin_bswap32((uint)d);

int main()
{
unsigned short s1, s2, s3;
unsigned int i1, i2, i3;

        s1 = 0x1234;
        putword(&s2, s1);
        s3 = word(&s2);
        i1 = 0x12345678;
        putdword(&i2, i1);
        i3 = dword(&i2);
        printf("s1=%x, s2=%x, s3=%x, i1=%x, i2=%x, i3=%x\n", s1, s2, s3, i1, i2, i3);

        s1 = 0x1234;
        fastputword(&s2, s1);
        s3 = fastword(&s2);
        i1 = 0x12345678;
        fastputdword(&i2, i1);
        i3 = fastdword(&i2);
        printf("s1=%x, s2=%x, s3=%x, i1=%x, i2=%x, i3=%x\n", s1, s2, s3, i1, i2, i3);
}

`*((char *)(a)) =` If you will use normal types instead of single bytes, compiler will be smart enough to optimize it to the same code... Why do you use `char` everywhere? The code is not equivalent - if `a` is not aligned to `ushort` or `uint` you'll get seg fault - thus the "improvement" you are seeing, less instruction, cause compiler can utilize aligned instruction. It's less related to `__builtin_bswap*`, but more that you are using `*((ushort *) a) =` instead of `*(char*)a =` assignments. — KamilCuk, Apr 04 '21 at 22:19
I tried modifying my test program to specifically force a non-aligned dword access via __builtin_bswap32, and it works. So, I guess X86_64 machines don't care about 16- and 32-bit memory alignment when fetching a short or an int. — littlenoodles, Apr 06 '21 at 15:20

score 0 · Answer 2 · answered Apr 04 '21 at 23:55

I would just use htons, htonl and friends. They're a lot more portable, and it's very likely that the authors of any given libc will have implemented them as inline functions or macros that invoke __builtin intrinsics or inline asm or whatever, resulting in what should be a nearly-optimal implementation for that specific machine. See what is generated in godbolt's setup, which I think is some flavor of Linux/glibc.

You do need to compile with optimizations for them to be inlined, otherwise it generates an ordinary function call. But even -Og gets them inlined and should not mess up your debugging as much. Anyway, if you're compiling without optimizations altogether, your entire program will be so inefficient that the extra couple instructions to call htons must surely be the least of your worries.

Hmmm. My use of the __builtin_bswap functions would be confined to macro definitions in my config.h file - only on systems where __builtin_bswap was known to be available. So if using htons would require optimiaztion at all, then I'd prefer to go directly to the 'source'. As far as whether I'd see any speed advantage, i guess that's yet to be seen - but (see my comment on my original post), for Series/1 machine emulation, I'd think the benefit would be significant. — littlenoodles, Apr 06 '21 at 15:16

what is the right way to access builtin_bswap functions?

2 Answers2