3

I'm trying to manipulate a special struct and I need some sort of a swizzle operator. For this it makes sense to have an overloaded array [] operator, but I don't want to have any branching since the particular specification of the struct allows for a theoretical workaround.

Currently, the struct looks like this:

struct f32x4
{
    float fLow[2];
    float fHigh[2];

    f32x4(float a, float b, float c, float d)
    {
        fLow[0] = a; 
        fLow[1] = b;
        fHigh[0] = c;
        fHigh[1] = d;
    }

    // template with an int here?
    inline float& operator[] (int x) {
        if (x < 2)
            return fLow[x];
        else
            return fHigh[x - 2];
    }
};

What could I/should I do to avoid the branch? My idea is to use a template with an integer parameter and define specializations, but it's not clear whether it does make sense and what the syntax of that monster could look like.

I explicitly, under no circumstances, can make use of a float[4] array to merge the two (also, no union tricks). If you need a good reason for that, it's because the float[2] are actually resembling a platform specific PowerPC paired singles. A normal windows compiler won't work with paired singles, that's why I replaced the code with float[2]s.

Using the GreenHills compiler I get this assembly output (which suggests branching does occur):

.LDW31:
00000050 80040000           89      lwz r0, 0(r4)
00000054 2c000000           90      cmpwi   r0, 0
00000058 41820000           91      beq .L69
                            92  #line32
                            93  
                            94  .LDWlin1:
0000005c 2c000001           95      cmpwi   r0, 1
00000060 40820000           96      bne .L74
                            97  #line32
                            98  
                            99  .LDWlin2:
00000064 38630004          100      addi    r3, r3, 4
00000068 38210018          101      addi    sp, sp, 24
0000006c 4e800020          102      blr
                           103  .L74:
00000070 2c000002          104      cmpwi   r0, 2
00000074 40820000          105      bne .L77
                           106  #line33
                           107  
                           108  .LDWlin3:
00000078 38630008          109      addi    r3, r3, 8
0000007c 38210018          110      addi    sp, sp, 24
00000080 4e800020          111      blr
                           112  .L77:
00000084 2c000003          113      cmpwi   r0, 3
00000088 40820000          114      bne .L80
                           115  #line34
                           116  
                           117  .LDWlin4:
0000008c 3863000c          118      addi    r3, r3, 12
00000090 38210018          119      addi    sp, sp, 24
00000094 4e800020          120      blr
                           121  .L80:
00000098 38610008          122      addi    r3, sp, 8
                           123  .L69:
                           124  #       .ef

The corresponding C++ code to that snippet should be this one:

 inline const float& operator[](const unsigned& idx) const
        {
            if (idx == 0)  return xy[0];
            if (idx == 1)  return xy[1];
            if (idx == 2)  return zw[0];
            if (idx == 3)  return zw[1];
            return 0.f;
        }
teodron
  • 1,410
  • 1
  • 20
  • 41
  • Can you elaborate on 'but I don't want to have any branching since the particular specification of the struct allows for a theoretical workaround'? – piokuc Dec 20 '12 at 16:52
  • @MarkB oops, yes, fixed the blunder. Of course, there's no assert(x < 4) in there either for brevity reasons. – teodron Dec 20 '12 at 16:53
  • @piokuc i.e. do it at compile time - since there are only 4 possible values for x that work with an instance of that class. – teodron Dec 20 '12 at 16:53
  • 1
    This questions feels really localized. – andre Dec 20 '12 at 17:18
  • Why do you want to treat something that is explicitly not a 4 element array as a 4 element array? – Mark B Dec 20 '12 at 17:32
  • 1
    @ahenderson - the set-up is localized, but it seems like a reasonable question on optimization technique to me. – Useless Dec 20 '12 at 17:37

5 Answers5

6

Either the index x is a runtime variable, or a compile-time constant.

  • if it is a compile-time constant, there's a good chance the optimizer will be able to prune the dead branch when inlining operator[] anyway.

  • if it is a runtime variable, like

    for (int i=0; i<4; ++i) { dosomething(f[i]); }
    

    you need the branch anyway. Unless, of course, your optimizer unrolls the loop, in which case it can replace the variable with four constants, inline & prune as above.

Did you profile this to show there's a real problem, and compile it to show the branch really happens where it could be avoided?


Example code:

float foo(f32x4 &f)
{
    return f[0]+f[1]+f[2]+f[3];
}

example output from g++ -O3 -S

.globl _Z3fooR5f32x4
        .type       _Z3fooR5f32x4, @function
_Z3fooR5f32x4:
.LFB4:
        .cfi_startproc
        movss       (%rdi), %xmm0
        addss       4(%rdi), %xmm0
        addss       8(%rdi), %xmm0
        addss       12(%rdi), %xmm0
        ret
        .cfi_endproc
Useless
  • 64,155
  • 6
  • 88
  • 132
  • I didn't profile it, but the assembly output seems to be branching for some odd reason (only template int params are used as input for the [] operator, that should count as a good candidate for a compiler optimization).. I'll see what I can do with it in the end. – teodron Dec 20 '12 at 17:16
  • 1
    I just checked, and `-O3` gets me complete inlining and constant folding [with gcc 4.5.1 on x86]. What does your call site look like? – Useless Dec 20 '12 at 17:22
  • I updated the code. The assembly output is written after I invoke a simple printf: `printf("%f %f %f %f", v0[0], v0[1], v0[2], v0[3]);` The compiler's strategy is to optimize for speed with maximum inlining. If I'm not terribly mistaking, there are still branches that could cost, right? – teodron Dec 21 '12 at 09:10
  • I see the branching in your compiler output - but I can't tell if that's an out-of-line instantiation ... could you show the call site too, and the optimization level? – Useless Dec 21 '12 at 11:16
  • I don't know if I can show too much, I'm under a certain NDA with that stuff and I'm afraid that even the assembly could be an issue for now... but what I can tell you is that the optimization is indeed set for speed, maximum inlining and even without any debug symbols, but it won't work as g++'s optimizer. – teodron Dec 21 '12 at 11:31
  • 1
    OK, no problem. If your optimizer can't manage the constant folding, I'd agree Luc's answer is probably the best choice. – Useless Dec 21 '12 at 12:29
4

Seriously, don't do this!! Just combine the arrays. But since you asked the question, here's an answer:

#include <iostream>

float fLow [2] = {1.0,2.0};
float fHigh [2] = {50.0,51.0};

float * fArrays[2] = {fLow, fHigh};

float getFloat (int i)
{
    return fArrays[i>=2][i%2];
}

int main()
{
    for (int i = 0; i < 4; ++i)
        std::cout << getFloat(i) << '\n';
    return 0;
}

Output:

1
2
50
51
BoBTFish
  • 19,167
  • 3
  • 49
  • 76
  • 2
    I'm not sure that replacing a branch with an indirection is exactly what OP needs (assuming the motivation is speed) – Useless Dec 20 '12 at 18:42
3

Since you said in a comment that your index is always a template parameter, then you can indeed make the branching at compile-time instead of runtime. Here is a possible solution using std::enable_if:

#include <iostream>
#include <type_traits>

struct f32x4
{
    float fLow[2];
    float fHigh[2];

    f32x4(float a, float b, float c, float d)
    {
        fLow[0] = a; 
        fLow[1] = b;
        fHigh[0] = c;
        fHigh[1] = d;
    }

    template <int x>
    float& get(typename std::enable_if<(x >= 0 && x < 2)>::type* = 0)
    {
        return fLow[x];
    }

    template <int x>
    float& get(typename std::enable_if<(x >= 2 && x < 4)>::type* = 0)
    {
        return fHigh[x-2];
    }
};

int main()
{
    f32x4 f(0.f, 1.f, 2.f, 3.f);

    std::cout << f.get<0>() << " " << f.get<1>() << " "
              << f.get<2>() << " " << f.get<3>(); // prints 0 1 2 3
}

Regarding performance, I don't think there will be any difference since the optimizer should be able to easily propagate the constants and remove dead code subsequently, thereby removing the branch altogether. However, with this approach, you get the benefit that any attempts to invoke the function with an invalid index will result in a compiler error.

Luc Touraille
  • 79,925
  • 15
  • 92
  • 137
  • As my knowledge goes, I could tag this as a real gem and give this answer at least a +5. Although it works marvelously on x86 platforms with pretty decent compilers (g++, VS cl), it won't work with my particular compiler/platform (seems like it doesn't support type traits). Nevertheless, all the other answers provide essential tips as well - in the end, I'll accept this one due to its dead-on content. Thanks a lot. – teodron Dec 21 '12 at 09:23
1

Create one array (or vector) with all 4 elements in it, the fLow values occupying the first two positions, then high in the second 2. Then just index into it.

inline float& operator[] (int x) {
    return newFancyArray[x]; //But do some bounds checking above.
}
Ryan Guthrie
  • 688
  • 3
  • 11
0

Based on Luc Touraille's answer, without using type traits due to their lack of compiler support, I found the following to achieve the purpose of the question. Since the operator[] can not be templatized with an int parameter and work syntactically, I introduced an at method. This is the result:

struct f32x4
{
    float fLow[2];
    float fHigh[2];

    f32x4(float a, float b, float c, float d)
    {
        fLow[0] = a; 
        fLow[1] = b;
        fHigh[0] = c;
        fHigh[1] = d;
    }


    template <unsigned T>
    const float& at() const;

};
template<>
const float& f32x4::at<0>() const { return fLow[0]; }
template<>
const float& f32x4::at<1>() const { return fLow[1]; }
template<>
const float& f32x4::at<2>() const { return fHigh[0]; }
template<>
const float& f32x4::at<3>() const { return fHigh[1]; }
teodron
  • 1,410
  • 1
  • 20
  • 41