Fastest way to spread 4 bytes into 8 bytes (32bit -> 64bit)

Question

Assume you have a 32-bit unsigned integer, where the bytes are organized like this: a b c d. What is the fastest way to spread these bytes into a 64-bit unsigned integer in this fashion: 0 a 0 b 0 c 0 d? It is for the x86-64 architecture. I would like to know the fastest approach without using special intrinsics, although that would also be interesting. (I say 'fastest', but compact solutions with reasonable performance is also nice).

Edit for people who want context. This seems like a really easy work, just shifting some bytes around, yet it requires more instructions than you'd think (check this godbolt with optimizations). Therefore I just wonder if anyone knows of a way that would solve the problem with fewer instructions.

What have you tried - please edit your best code into your question and eplain why you think it isn’t what you need — DisappointedByUnaccountableMod, Sep 18 '20 at 18:27
from the `performance` tag: "For questions pertaining to the measurement or improvement of code and application efficiency." You have nothing to measure or improve yet. Unless you have something, anything **is** the fastest. Smells like premature optimization. Please show your code — 463035818_is_not_an_ai, Sep 18 '20 at 18:32
This works for 16-bit to 32-bit spreading: `((x * 0x0101010101010101L & 0x8040201008040201L) * 0x0102040810204081L >> 49) & 0x5555`. Taken from [this thread](https://stackoverflow.com/questions/36369006/how-to-spread-bits-in-a-byte). — brc-dd, Sep 18 '20 at 18:47
The updated function is fine. The alternative is a loop which would add a handful of additional instructions. — David C. Rankin, Sep 18 '20 at 18:58
Pick whatever you like the most https://godbolt.org/z/3E7Gsa but take into consideration that on x86_64, less instructions doesn't necessarily mean faster execution time. — Alex Lop., Sep 18 '20 at 19:14
@AlexLop. I like the additional solution of using a union. I know less instructions does not mean faster execution time, but judging based on execution time is too machine dependant. Generated instructions is easier to compare. — HueHue, Sep 18 '20 at 19:52
@stark Maybe so, but I would not consider a naive 32 GB lookup table a reasonable or practical way of solving this problem. Maybe a nice mixed implementation, with a small lookup table and few instructions is possible though. — HueHue, Sep 18 '20 at 20:02
*I like the additional solution of using a union.* The downside of using a union for type punning is **undefined behavior** in C++. — Eljay, Sep 18 '20 at 20:18
Are you only going to do this once? Or on a stream of `int`s? Sounds like a job for SIMD — Vlad Feinstein, Sep 18 '20 at 20:31

zch · Accepted Answer · 2020-09-18T21:10:13.797

5

uint64_t x = ...;
// 0 0 0 0 a b c d
x |= x << 16;
// 0 0 a b ? ? c d
x = x << 8 & 0x00ff000000ff0000 | x & 0x000000ff000000ff;
// 0 a 0 b 0 c 0 d

And for completeness, modern x86 processors can do this with one quick instruction:

x = _pdep_u64(x, 0xff00ff00ff00ff)

edited Sep 18 '20 at 21:10

answered Sep 18 '20 at 18:37

zch

14,931
2
41
49

I like it! Shorter and saves 3 operations when compared to the godbolt I provided in the question. – HueHue Sep 18 '20 at 19:10
wouldn't _pdep_u64(x, 0xff00ff00ff00ff) simply pass on the bits at the indicated locations? It has a 32 bit input and 32 bit output. The only instruction(s) I've found that can do the requested byte-to-word transformation are the various forms of punpack. And on x64 the only one I've found takes either the high or low __m128i of an __m256i and spreads it over the whole __m256i. 32-bit code may have a 32 to 64 bit version but I've found no way to make that work in x64 code. – SoronelHaetir Sep 18 '20 at 23:43
@SoronelHaetir, the task is to pass bits to indicated locations. It is equivalent to other solutions https://godbolt.org/z/7vq49n – zch Sep 19 '20 at 09:37

Vlad Feinstein · Answer 2 · 2020-09-18T22:01:03.873

Something like this?

_mm256_cvtepu8_epi16(eight_bit_numbers): takes a 128-bit vector of sixteen 8-bit numbers, and converts it to a 256-bit vector of sixteen 16-bit signed integers. For example:

 __m128i value1 = _mm_setr_epi8(0x11, 0x22, 0x33, 0x44, 
    0x55, 0x66, 0x77, 0x88, 0x99, 0xaa, 0xbb, 0xcc, 0xdd, 0xee, 0xff, 0x00);
 __m256i value2 = _mm256_cvtepu8_epi16(value1);

Or for 32-bit -> 64-bit:

https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm256_cvtepu32_epi64

Fastest way to spread 4 bytes into 8 bytes (32bit -> 64bit)

2 Answers2