Interleave 4 byte ints to 8 byte int

Question

I'm currently working to create a function which accepts two 4 byte unsigned integers, and returns an 8 byte unsigned long. I've tried to base my work off of the methods depicted by this research but all my attempts have been unsuccessful. The specific inputs I am working with are: 0x12345678 and 0xdeadbeef, and the result I'm looking for is 0x12de34ad56be78ef. This is my work so far:

unsigned long interleave(uint32_t x, uint32_t y){
    uint64_t result = 0;
    int shift = 33;

    for(int i = 64; i > 0; i-=16){
        shift -= 8;
        //printf("%d\n", i);
        //printf("%d\n", shift);
        result |= (x & i) << shift;
        result |= (y & i) << (shift-1);
    }
}

However, this function keeps returning 0xfffffffe which is incorrect. I am printing and verifying these values using:

printf("0x%x\n", z);

and the input is initialized like so:

uint32_t x = 0x12345678;
uint32_t y = 0xdeadbeef;

Any help on this topic would be greatly appreciated, C has been a very difficult language for me, and bitwise operations even more so.

Related, it may be educational to see t Why you're not ushe what `std::cout << sizeof(unsigned long)` on your platform is. Why you're not using `uint64_t` when you're already using `uint32_t` is a little odd. — WhozCraig, Sep 06 '18 at 22:15
your `i` values are not the correct masks. And your shift amount needs to be decremented in steps of 8 bits. — Barmar, Sep 06 '18 at 22:21
Use inline assembly with [`PSHUFB`](http://www.felixcloutier.com/x86/PSHUFB.html) or its intrinsics equivalent: `(V)PSHUFB: __m128i _mm_shuffle_epi8 (__m128i a, __m128i b)`. — zx485, Sep 06 '18 at 22:37
`result` doesn't have an initial value. It you are `or`ing stuff into it, you need to make sure it's empty first. (`result = 0;`) — tdk001, Sep 06 '18 at 22:45
Even after your edit, @Barmar's point remains true. You now use `i` values of 64, 48, 32 and 16; those only mask off single bits, not whole bytes, so now you're setting at most eight bits, not eight bytes. And you're shifting by 33 and 32, 25 and 24, etc. so you're shifting into adjacent bits, not bytes. — ShadowRanger, Sep 06 '18 at 22:56

harold · Answer 1 · 2018-09-07T15:33:46.720

This can be done based on interleaving bits, but skipping some steps so it only interleaves bytes. Same idea: first spread out the bytes in a couple of steps, then combine them.

Here is the plan, illustrated with my amazing freehand drawing skills:

In C (not tested):

// step 1, moving the top two bytes
uint64_t a = (((uint64_t)x & 0xFFFF0000) << 16) | (x & 0xFFFF);
// step 2, moving bytes 2 and 6
a = ((a & 0x00FF000000FF0000) << 8) | (a & 0x000000FF000000FF);
// same thing with y
uint64_t b = (((uint64_t)y & 0xFFFF0000) << 16) | (y & 0xFFFF);
b = ((b & 0x00FF000000FF0000) << 8) | (b & 0x000000FF000000FF);
// merge them
uint64_t result = (a << 8) | b;

Using SSSE3 PSHUFB has been suggested, it'll work but there is an instruction that can do a byte-wise interleave in one go, punpcklbw. So all we need to really do is get the values into and out of vector registers, and that single instruction will then just care of it.

Not tested:

uint64_t interleave(uint32_t x, uint32_t y) {
  __m128i xvec = _mm_cvtsi32_si128(x);
  __m128i yvec = _mm_cvtsi32_si128(y);
  __m128i interleaved = _mm_unpacklo_epi8(yvec, xvec);
  return _mm_cvtsi128_si64(interleaved);
}

score 1 · Answer 2 · answered Sep 06 '18 at 23:02

1

You could do it like this:

uint64_t interleave(uint32_t x, uint32_t y)
{
     uint64_t z;

     unsigned char *a = (unsigned char *)&x;   // 1
     unsigned char *b = (unsigned char *)&y;   // 1
     unsigned char *c = (unsigned char *)&z;

     c[0] = a[0];
     c[1] = b[0];
     c[2] = a[1];
     c[3] = b[1];
     c[4] = a[2];
     c[5] = b[2];
     c[6] = a[3];
     c[7] = b[3];

     return z;
}

Interchange a and b on the lines marked 1 depending on ordering requirement.

A version with shifts, where the LSB of y is always the LSB of the output as in your example, is:

uint64_t interleave(uint32_t x, uint32_t y)
{
     return 
           (y & 0xFFull)
         | (x & 0xFFull)       << 8
         | (y & 0xFF00ull)     << 8
         | (x & 0xFF00ull)     << 16
         | (y & 0xFF0000ull)   << 16
         | (x & 0xFF0000ull)   << 24
         | (y & 0xFF000000ull) << 24
         | (x & 0xFF000000ull) << 32;
}

The compilers I tried don't seem to do a good job of optimizing either version so if this is a performance critical situation then maybe the inline assembly suggestion from comments is the way to go.

answered Sep 06 '18 at 23:02

M.M

138,810
21
208
365

The first version depends on whether the machine is big-endian or little-endian – Barmar Sep 06 '18 at 23:54
The second version returns `0x56be78ef` which is the last half of my desired output, I tried to extend it to no avail. – Philip DiSarro Sep 07 '18 at 00:00
@PhilipDiSarro the code I posted works correctly , perhaps your attempt made a mistake somewhere – M.M Sep 07 '18 at 00:20
@Barmar I address this in the first line after the snippet – M.M Sep 07 '18 at 00:20
Are you referring to "depending on ordering requirement"? How are you supposed to know what the order is? – Barmar Sep 07 '18 at 00:21
@Barmar The situation the code is being used in will determine which order is desired. This code can be used for either ordering by making the adjustment I suggested – M.M Sep 07 '18 at 00:24
The hardware architecture and compiler implementation determines whether it's big-endian or little-endian, not the situation where the code is being used. – Barmar Sep 07 '18 at 00:25
@Barmar I am not sure what point you are trying to make , sorry. Normally in programming there is a requirement which the code implements. E.g. if someone wants to output `2` you might write `printf("2");` or various other options. This isn't determined by the hardware or whatever, there are requirements dictated by the task at hand. For example maybe they want the first byte in memory of `y` to be the first byte in memory of the result, in which case my first code sample works on all architectures and my second doesn't. Which is why I provided the two samples – M.M Sep 07 '18 at 00:28
Actually, we seem to be talking about different things. On a machine with different endianness, you need to change `a[0]` to `a[3]`, `a[1]` to `a[2]`, etc. – Barmar Sep 07 '18 at 00:34
@Barmar The endianness of the `uint64_t` is the same as the endianness of the `uint32_t`, typically – M.M Sep 07 '18 at 00:37
@M.M I suppose that's likely. All the answers that use type punning are undefined or implementation-defined behavior. I'd stick with just your second version using masking and shifting. – Barmar Sep 07 '18 at 00:39

bigwillydos · Accepted Answer · 2018-09-07T16:38:22.837

1

With bit-shifting and bitwise operations (endianness independent):

uint64_t interleave(uint32_t x, uint32_t y){

    uint64_t result = 0;

    for(uint8_t i = 0; i < 4; i ++){
        result |= ((x & (0xFFull << (8*i))) << (8*(i+1)));
        result |= ((y & (0xFFull << (8*i))) << (8*i));
    }

    return result;
}

With pointers (endianness dependent):

uint64_t interleave(uint32_t x, uint32_t y){

    uint64_t result = 0;

    uint8_t * x_ptr = (uint8_t *)&x;
    uint8_t * y_ptr = (uint8_t *)&y;
    uint8_t * r_ptr = (uint8_t *)&result;

    for(uint8_t i = 0; i < 4; i++){
        *(r_ptr++) = y_ptr[i];
        *(r_ptr++) = x_ptr[i];
    }

    return result;

}

Note: this solution assumes little-endian byte order

edited Sep 07 '18 at 16:38

answered Sep 06 '18 at 23:22

bigwillydos

1,321
1
10
15

This returns `0x56ffffef` – Philip DiSarro Sep 06 '18 at 23:48
Hm, weird. Works fine on my machine. Tested it [here](http://tpcg.io/777UOZ) as well. – bigwillydos Sep 07 '18 at 00:00
@PhilipDiSarro added a way to do it with pointers, similar to one of the other answers but uses a for-loop. Tested it [here](http://tpcg.io/T02Tix). – bigwillydos Sep 07 '18 at 00:33
1

The latter code, aliasing through `uint8_t`, depends on byte order. – Eric Postpischil Sep 07 '18 at 00:50
1

Your first answer needs to zero-initialize `result`. – Zrax Sep 07 '18 at 16:21
@EricPostpischil 100% correct, i'll add it to my answer – bigwillydos Sep 07 '18 at 16:34
@Zrax Thanks, i'll add that too – bigwillydos Sep 07 '18 at 16:35

score 1 · Answer 4 · answered Sep 06 '18 at 23:25

1

use union punning. Easy for the compiler to optimize.

#include <stdio.h>
#include <stdint.h>
#include <string.h>

typedef union
{
        uint64_t u64;
        struct 
        {
            union
            {
                uint32_t a32;
                uint8_t a8[4]
            };
            union
            {
                uint32_t b32;
                uint8_t b8[4]
            };
        };
        uint8_t u8[8];
}data_64;

uint64_t interleave(uint32_t a, uint32_t b)
{
    data_64 in , out;

    in.a32 = a;
    in.b32 = b;



    for(size_t index = 0; index < sizeof(a); index ++)
    {

        out.u8[index * 2 + 1] = in.a8[index];
        out.u8[index * 2 ] = in.b8[index];
    }
    return out.u64;
}


int main(void)
{

    printf("%llx\n", interleave(0x12345678U, 0xdeadbeefU)) ;
}

answered Sep 06 '18 at 23:25

0___________

60,014
4
34
74

This code is not portable as it depends on byte order. – Eric Postpischil Sep 07 '18 at 00:51
same as all other posted here - but it is extremely easy amendable to be universal – 0___________ Sep 07 '18 at 00:53
No, not the same as all the others. [M.M’s answer](https://stackoverflow.com/a/52213142/298225) includes a solution not dependent on byte order, and the one that is dependent on byte order mentions that. [bigwillydos' answer](https://stackoverflow.com/a/52213263/298225) includes a solution not dependent on byte order, although it fails to document that its other solution is dependent. Engineering is not throw some code out to be used at one’s own risk; engineering is documenting what the code characteristics are are (among other things). – Eric Postpischil Sep 07 '18 at 01:05
What is "easy for compiler to optimize" based on? If I [try this](https://gcc.godbolt.org/z/BoUIyd) it doesn't come out all that well.. Clang literally keeps the memory writes and read at the end, GCC keeps it in registers but I'm not really impressed by what it did. – harold Sep 07 '18 at 15:42

Interleave 4 byte ints to 8 byte int

4 Answers4