1

I'm currently working to create a function which accepts two 4 byte unsigned integers, and returns an 8 byte unsigned long. I've tried to base my work off of the methods depicted by this research but all my attempts have been unsuccessful. The specific inputs I am working with are: 0x12345678 and 0xdeadbeef, and the result I'm looking for is 0x12de34ad56be78ef. This is my work so far:

unsigned long interleave(uint32_t x, uint32_t y){
    uint64_t result = 0;
    int shift = 33;

    for(int i = 64; i > 0; i-=16){
        shift -= 8;
        //printf("%d\n", i);
        //printf("%d\n", shift);
        result |= (x & i) << shift;
        result |= (y & i) << (shift-1);
    }
}

However, this function keeps returning 0xfffffffe which is incorrect. I am printing and verifying these values using:

printf("0x%x\n", z);

and the input is initialized like so:

uint32_t x = 0x12345678;
uint32_t y = 0xdeadbeef;

Any help on this topic would be greatly appreciated, C has been a very difficult language for me, and bitwise operations even more so.

Philip DiSarro
  • 1,007
  • 6
  • 9
  • 1
    Related, it may be educational to see t Why you're not ushe what `std::cout << sizeof(unsigned long)` on your platform is. Why you're not using `uint64_t` when you're already using `uint32_t` is a little odd. – WhozCraig Sep 06 '18 at 22:15
  • You need to use `%lx` to print an unsigned long. – Barmar Sep 06 '18 at 22:15
  • 1
    your `i` values are not the correct masks. And your shift amount needs to be decremented in steps of 8 bits. – Barmar Sep 06 '18 at 22:21
  • 1
    Use inline assembly with [`PSHUFB`](http://www.felixcloutier.com/x86/PSHUFB.html) or its intrinsics equivalent: `(V)PSHUFB: __m128i _mm_shuffle_epi8 (__m128i a, __m128i b)`. – zx485 Sep 06 '18 at 22:37
  • 2
    `result` doesn't have an initial value. It you are `or`ing stuff into it, you need to make sure it's empty first. (`result = 0;`) – tdk001 Sep 06 '18 at 22:45
  • Even after your edit, @Barmar's point remains true. You now use `i` values of 64, 48, 32 and 16; those only mask off single bits, not whole bytes, so now you're setting at most eight bits, not eight bytes. And you're shifting by 33 and 32, 25 and 24, etc. so you're shifting into adjacent bits, not bytes. – ShadowRanger Sep 06 '18 at 22:56

4 Answers4

3

This can be done based on interleaving bits, but skipping some steps so it only interleaves bytes. Same idea: first spread out the bytes in a couple of steps, then combine them.

Here is the plan, illustrated with my amazing freehand drawing skills:

permutation

In C (not tested):

// step 1, moving the top two bytes
uint64_t a = (((uint64_t)x & 0xFFFF0000) << 16) | (x & 0xFFFF);
// step 2, moving bytes 2 and 6
a = ((a & 0x00FF000000FF0000) << 8) | (a & 0x000000FF000000FF);
// same thing with y
uint64_t b = (((uint64_t)y & 0xFFFF0000) << 16) | (y & 0xFFFF);
b = ((b & 0x00FF000000FF0000) << 8) | (b & 0x000000FF000000FF);
// merge them
uint64_t result = (a << 8) | b;

Using SSSE3 PSHUFB has been suggested, it'll work but there is an instruction that can do a byte-wise interleave in one go, punpcklbw. So all we need to really do is get the values into and out of vector registers, and that single instruction will then just care of it.

Not tested:

uint64_t interleave(uint32_t x, uint32_t y) {
  __m128i xvec = _mm_cvtsi32_si128(x);
  __m128i yvec = _mm_cvtsi32_si128(y);
  __m128i interleaved = _mm_unpacklo_epi8(yvec, xvec);
  return _mm_cvtsi128_si64(interleaved);
}
harold
  • 61,398
  • 6
  • 86
  • 164
1

You could do it like this:

uint64_t interleave(uint32_t x, uint32_t y)
{
     uint64_t z;

     unsigned char *a = (unsigned char *)&x;   // 1
     unsigned char *b = (unsigned char *)&y;   // 1
     unsigned char *c = (unsigned char *)&z;

     c[0] = a[0];
     c[1] = b[0];
     c[2] = a[1];
     c[3] = b[1];
     c[4] = a[2];
     c[5] = b[2];
     c[6] = a[3];
     c[7] = b[3];

     return z;
}

Interchange a and b on the lines marked 1 depending on ordering requirement.

A version with shifts, where the LSB of y is always the LSB of the output as in your example, is:

uint64_t interleave(uint32_t x, uint32_t y)
{
     return 
           (y & 0xFFull)
         | (x & 0xFFull)       << 8
         | (y & 0xFF00ull)     << 8
         | (x & 0xFF00ull)     << 16
         | (y & 0xFF0000ull)   << 16
         | (x & 0xFF0000ull)   << 24
         | (y & 0xFF000000ull) << 24
         | (x & 0xFF000000ull) << 32;
}

The compilers I tried don't seem to do a good job of optimizing either version so if this is a performance critical situation then maybe the inline assembly suggestion from comments is the way to go.

M.M
  • 138,810
  • 21
  • 208
  • 365
  • The first version depends on whether the machine is big-endian or little-endian – Barmar Sep 06 '18 at 23:54
  • The second version returns `0x56be78ef` which is the last half of my desired output, I tried to extend it to no avail. – Philip DiSarro Sep 07 '18 at 00:00
  • @PhilipDiSarro the code I posted works correctly , perhaps your attempt made a mistake somewhere – M.M Sep 07 '18 at 00:20
  • @Barmar I address this in the first line after the snippet – M.M Sep 07 '18 at 00:20
  • Are you referring to "depending on ordering requirement"? How are you supposed to know what the order is? – Barmar Sep 07 '18 at 00:21
  • @Barmar The situation the code is being used in will determine which order is desired. This code can be used for either ordering by making the adjustment I suggested – M.M Sep 07 '18 at 00:24
  • The hardware architecture and compiler implementation determines whether it's big-endian or little-endian, not the situation where the code is being used. – Barmar Sep 07 '18 at 00:25
  • @Barmar I am not sure what point you are trying to make , sorry. Normally in programming there is a requirement which the code implements. E.g. if someone wants to output `2` you might write `printf("2");` or various other options. This isn't determined by the hardware or whatever, there are requirements dictated by the task at hand. For example maybe they want the first byte in memory of `y` to be the first byte in memory of the result, in which case my first code sample works on all architectures and my second doesn't. Which is why I provided the two samples – M.M Sep 07 '18 at 00:28
  • Actually, we seem to be talking about different things. On a machine with different endianness, you need to change `a[0]` to `a[3]`, `a[1]` to `a[2]`, etc. – Barmar Sep 07 '18 at 00:34
  • @Barmar The endianness of the `uint64_t` is the same as the endianness of the `uint32_t`, typically – M.M Sep 07 '18 at 00:37
  • @M.M I suppose that's likely. All the answers that use type punning are undefined or implementation-defined behavior. I'd stick with just your second version using masking and shifting. – Barmar Sep 07 '18 at 00:39
1

With bit-shifting and bitwise operations (endianness independent):

uint64_t interleave(uint32_t x, uint32_t y){

    uint64_t result = 0;

    for(uint8_t i = 0; i < 4; i ++){
        result |= ((x & (0xFFull << (8*i))) << (8*(i+1)));
        result |= ((y & (0xFFull << (8*i))) << (8*i));
    }

    return result;
}

With pointers (endianness dependent):

uint64_t interleave(uint32_t x, uint32_t y){

    uint64_t result = 0;

    uint8_t * x_ptr = (uint8_t *)&x;
    uint8_t * y_ptr = (uint8_t *)&y;
    uint8_t * r_ptr = (uint8_t *)&result;

    for(uint8_t i = 0; i < 4; i++){
        *(r_ptr++) = y_ptr[i];
        *(r_ptr++) = x_ptr[i];
    }

    return result;

}

Note: this solution assumes little-endian byte order

bigwillydos
  • 1,321
  • 1
  • 10
  • 15
1

use union punning. Easy for the compiler to optimize.

#include <stdio.h>
#include <stdint.h>
#include <string.h>

typedef union
{
        uint64_t u64;
        struct 
        {
            union
            {
                uint32_t a32;
                uint8_t a8[4]
            };
            union
            {
                uint32_t b32;
                uint8_t b8[4]
            };
        };
        uint8_t u8[8];
}data_64;

uint64_t interleave(uint32_t a, uint32_t b)
{
    data_64 in , out;

    in.a32 = a;
    in.b32 = b;



    for(size_t index = 0; index < sizeof(a); index ++)
    {

        out.u8[index * 2 + 1] = in.a8[index];
        out.u8[index * 2 ] = in.b8[index];
    }
    return out.u64;
}


int main(void)
{

    printf("%llx\n", interleave(0x12345678U, 0xdeadbeefU)) ;
}
0___________
  • 60,014
  • 4
  • 34
  • 74
  • This code is not portable as it depends on byte order. – Eric Postpischil Sep 07 '18 at 00:51
  • same as all other posted here - but it is extremely easy amendable to be universal – 0___________ Sep 07 '18 at 00:53
  • No, not the same as all the others. [M.M’s answer](https://stackoverflow.com/a/52213142/298225) includes a solution not dependent on byte order, and the one that is dependent on byte order mentions that. [bigwillydos' answer](https://stackoverflow.com/a/52213263/298225) includes a solution not dependent on byte order, although it fails to document that its other solution is dependent. Engineering is not throw some code out to be used at one’s own risk; engineering is documenting what the code characteristics are are (among other things). – Eric Postpischil Sep 07 '18 at 01:05
  • What is "easy for compiler to optimize" based on? If I [try this](https://gcc.godbolt.org/z/BoUIyd) it doesn't come out all that well.. Clang literally keeps the memory writes and read at the end, GCC keeps it in registers but I'm not really impressed by what it did. – harold Sep 07 '18 at 15:42