8

I need to efficiently swap the byte order of an array during copying into another array.

The source array is of a certain type; char, short or int so the byte swapping required is unambiguous and will be according to that type.

My plan is to do this very simply with a multi-pass byte-wise copy (2 for short, 4 for int, ...). However are there any pre-existing "memcpy_swap_16/32/64" functions or libraries? Perhaps in image processing for BGR/RGB image processing.

EDIT

I know how to swap the bytes of individual values, that is not the problem. I want to do this process during a copy that I am going to perform anyway.

For example, if I have an array or little endian 4-byte integers I can do they swap by performing 4 bytewise copies with initial offsets of 0, 1, 2 and 3 with a stride of 4. But there may be a better way, perhaps even reading each 4-byte integer individually and using the byte-swap intrinsics _byteswap_ushort, _byteswap_ulong and _byteswap_uint64 would be faster. But I suspect there must be existing functions that do this type of processing.

EDIT 2

Just found this, which may be a useful basis for SSE, though its true that memory bandwidth probably makes it a waste of time.

Fast vectorized conversion from RGB to BGRA

Community
  • 1
  • 1

3 Answers3

6

Unix systems have a swab function that does what you want for 16-bit arrays. It's probably optimized, but I'm not sure. Note that modern gcc will generate extremely efficient code if you just write the naive byte swap code:

uint32_t x, y;
y = (x<<24) | (x<<8 & 0xff0000) | (x>>8 & 0xff00) | (x>>24);

i.e. it will use the bswap instruction on i486+. Presumably putting this in a loop will give an efficient loop too...

Edit: For your copying task, I would do the following in your loop:

  1. Read a 32-bit value from const uint32_t *src.
  2. Use the above code to swap it.
  3. Write a 32-bit value to uint32_t *dest.

Strictly speaking this may not be portable (aliasing violations) but as long as the copy function is in its own translation unit and not getting inlined, there's very little to worry about. Forget what I wrote about aliasing; if you're swapping the data as 32-bit values, it almost surely was actually 32-bit values to begin with, not some other type of pointer that was cast, so there's no issue.

R.. GitHub STOP HELPING ICE
  • 208,859
  • 35
  • 376
  • 711
  • 1
    The compiler byte swap intrinsics are a better way to guarantee use of the correct instruction. But this is not the problem. –  Sep 08 '11 at 12:45
  • I'm not sure why you'd call them "better". They're specific to a particular compiler. The code I gave will generate the "correct" instruction on any compiler that actually bothers to optimize. – R.. GitHub STOP HELPING ICE Sep 08 '11 at 12:49
  • Because it will be fast even in unoptimized debug builds. –  Sep 08 '11 at 13:08
  • Agree with copying in chunks as it reduces memory accesses. –  Sep 08 '11 at 13:25
  • Turning off optimization for debugging should only be a last resort, not standard practice. Actually, turning off optimization tends to **hide bugs**, since a large portion of bugs in C code come from invoking undefined behavior in situations that might give the "right" behavior with a naive translation of the C to machine code, but serious bugs with a real optimizing compiler. – R.. GitHub STOP HELPING ICE Sep 08 '11 at 13:44
  • @R.. gcc optimizes *only* for bswap of the long/pointer size (32 bits on i386, 64 bits on x86-64). Inverting byte order of other sized word still results in dealing with each mask separately, even with gcc 4.9 and -O3. Seems it has been fed with optimization only for a single explicit case. More so the same is true for clang (up to 3.3). – Netch Dec 17 '13 at 18:41
  • @Netch: For the 16-bit version, GCC uses `xchg %ah,%al`. What cases are you claiming it fails to optimize? – R.. GitHub STOP HELPING ICE Dec 17 '13 at 18:58
  • @R.. on 32-bit system it fails to optimize 64-bit byte swapping. On 64-bit system its, surprisingly, fails to optimize 32-bit byte swapping:( – Netch Dec 18 '13 at 21:01
  • @R.. The most precise description what I've got: http://segfault.kiev.ua/~netch/articles/20131219-bswap.txt – Netch Dec 19 '13 at 07:01
3

In linux, you should check the header bits/byteswap.h. there's a family of macros of the form bswap_##, and some of them use assembly instructions where appropriate.

Foo Bah
  • 25,660
  • 5
  • 55
  • 79
  • 2
    This header, as written, is an abomination. They use inline assembly to make it "fast", then gcc extensions to favor C over the assembly when the arguments are constants so gcc can collapse the constants. BUT -- and here's what makes it almost funny if it weren't so sad -- gcc will generate the same or better asm on its own if you just write the naive C like I wrote in my answer. – R.. GitHub STOP HELPING ICE Sep 08 '11 at 03:10
  • @R. it contains 16,32,64 bit implementations and handles 32/64 bit systems correctly. And makes it as simple as `bswap_16(...)`. – Foo Bah Sep 08 '11 at 03:13
  • I agree the functions are useful. I just claim the header is horribly written. If all the inline asm and gcc extensions were ripped out of it, the generated code would be just as good or better, and the possibility of bugs and incompatibilities would be nearly eliminated. Also, cleaning it up would aid in teaching newbies not to practice premature optimization... – R.. GitHub STOP HELPING ICE Sep 08 '11 at 03:18
1

Yes there are existing functions like the one linked in the question but its not worth the effort because the size of the data (in this case) means the set up overhead is too high. So instead, it's better to just read out 2, 4, and 8 bytes at a time and do the swap using intrinsics and write back.