2

I am trying to calculate log2(x) using integer-arithmetic operations.

The input x is a value between 1 and 2.

Because this will simply yield 0, everything is pre-scaled by 16.

In other words:

  • The function takes the integer value of x * 2^16 instead of x
  • The function returns the integer value of log2(x) * 2^16 instead of log2(x)

Here is my code:

uint64_t Log2(uint64_t x)
{
    static uint64_t TWO = (uint64_t)2 << 16;

    uint64_t res = 0;

    for (int i=0; i<16; i++)
    {
        x = (x * x) >> 16;
        if (x >= TWO)
        {
            x >>= 1;
            res += 1 << (15 - i);
        }
    }

    return res;
}

What I am looking for is a way to improve the performance of the loop.

halfer
  • 19,824
  • 17
  • 99
  • 186
goodvibration
  • 5,980
  • 4
  • 28
  • 61
  • Let the loop run backwards? No need for `31 - i` then. – alk Jul 29 '17 at 14:59
  • @Blastfurnace: The real `x` is any value between 1 and 2. So the input is any value between `0x100000000` and `0x200000000`. – goodvibration Jul 29 '17 at 15:04
  • @Blastfurnace: The prescaled input is an integer value representing the real input. – goodvibration Jul 29 '17 at 15:06
  • @alk: Damn it, stupid me (I used to use `i` for some other thing, now it's gone). Thanks. That will yield a minor improvement I suppose. – goodvibration Jul 29 '17 at 15:07
  • This function, as written, will always return 0 (hint: what values would `x` have to have in order for `x >= TWO` to be true, and what values can it have after `x = (x * x) >> 32`?), and thus can be optimized to `return 0;`. –  Jul 29 '17 at 15:11
  • @Fanael: Sorry, you are correct. I copy-pasted from a different code of mine, which makes use of `uint128_t` which does not exists here. I will fix this in a minute, thank you for diverting my attention to this. – goodvibration Jul 29 '17 at 15:31
  • @alk: GCC compiles both variants to the same code. –  Jul 29 '17 at 15:35
  • @Fanael: 1+ to GCC! :-) Thanks for checking, BTW. – alk Jul 29 '17 at 15:36
  • @goodvibration What's the target CPU? –  Jul 29 '17 at 15:40
  • @Fanael: 64 bits for the sake of this question. – goodvibration Jul 29 '17 at 15:45
  • What architecture? x86? –  Jul 29 '17 at 15:46
  • @Fanael: No assembly allowed. – goodvibration Jul 29 '17 at 15:52
  • That's irrelevant. I'll ask again: what architecture? –  Jul 29 '17 at 16:02
  • @Fanael: Yes, x86. – goodvibration Jul 29 '17 at 16:18
  • What is your optimization setting when you compile? – Thomas Matthews Jul 29 '17 at 16:32
  • You could use another "count down" variable to remove the "15-i" statement. Decrementing a variable may be faster than a subtraction. – Thomas Matthews Jul 29 '17 at 16:34
  • 2
    You post would be a better fit for [codereview.se]. – Thomas Matthews Jul 29 '17 at 16:35
  • 1
    So a deeply pipelined, out-of-order architecture with fast, fully pipelined multiplication. I'd recommend replacing the `if` with branchless code instead, then. I'd also look closely at trying to reduce the length of the dependency chain, but that may be hard with this algorithm — each iteration depends on the value of `x` from the previous iteration, greatly restricting the OoOE engine in utilizing instruction level parallelism. –  Jul 29 '17 at 16:35
  • @ThomasMatthews See [my previous comment](https://stackoverflow.com/questions/45390514/improve-performance-of-calculating-log2-of-a-number-between-1-and-2?noredirect=1#comment77742615_45390514). –  Jul 29 '17 at 16:36
  • I'd try a loop like `for(std::uint64_t i = 0x8000; i != 0; i >>= 1)` instead, with `res += 1 << (15 - i);` replaced simply with `res += i;`. –  Jul 29 '17 at 16:37
  • "everything is pre-scaled by 16" --> "everything is pre-scaled by `1 <<16`". – chux - Reinstate Monica Jul 29 '17 at 17:21
  • @Fanael: The architecture is not relevant, as OP clearly stated he does not want to use assembly code. – too honest for this site Jul 29 '17 at 17:28
  • Did you consider a simple lookup table? – rici Jul 29 '17 at 17:29
  • 2
    @Olaf: Do you really think that, when it comes to optimizing, AVR, x86 and a GPU are exactly the same thing? –  Jul 29 '17 at 17:32
  • @Fanael: If the code is written properly (i.e. no unnecessarily wide types, etc.), you should leave optimisations to the compiler in the first place. And yes, you can implement that algorithm platform independent so it will compile fine on all those platforms (possibly except for a GPU or other special-purpose processors, but the question is C, not OpenCL or CUDA). Nevertheless, this is off-topic here as is the question. – too honest for this site Jul 29 '17 at 17:41
  • @rici: Can't use arrays. Expensive on the current infrastructure. – goodvibration Jul 29 '17 at 17:55
  • @goodvibration: What do you mean by expensive? Why? Do you want a solution which has a loop 0->15? Or any kind of solution is OK? – geza Jul 29 '17 at 19:56
  • @goodvibration: do you have to use this iterative approach, or can you use something else entirely, as long as it gets the right answers? –  Jul 29 '17 at 21:31
  • 1
    Review of functional code is off-topic for Stack Overflow. Such questions belong on [codereview.se]. – Makyen Jul 29 '17 at 23:45

3 Answers3

2

While you said in the comments that you don't want a lookup table based solution, I still present one here. The reason is simple: this lookup table is 516 bytes. And if I compile your Log2 with -O3, I get a ~740 byte function for that, so it is in the same ballpark.

I didn't create a solution which perfectly matches yours. The reason is simple: your version is not as precise as possible. I used rint(log(in/65536.0f)/log(2)*65536) as a reference. Your version produces worst difference of 2, and average difference of 1.0. This proposed version has a worst difference of 1, and average difference of 0.2. So this version is more accurate.

About performance: I've checked two microbenchmarks:

  • use a simple LCG random generator for input. My version is 29 times faster
  • use numbers 0x10000->0x20000 linearly. My version is 17 times faster

The solution is extremely simple (use initTable() to initialize the lookup table), it linearly interpolates between table elements:

unsigned short table[0x102];
void initTable() {
    for (int i=0; i<0x102; i++) {
        int v = rint(log(i*0x100/65536.0f+1)/log(2)*65536);
        if (v>0xffff) v = 0xffff;
        table[i] = v;
    }
}

int log2(int val) {
    int idx = (val-0x10000)>>8;
    int l0 = table[idx];
    int l1 = table[idx+1];

    return l0+(((l1-l0)*(val&0xff)+128)>>8);
}

I've just played with the table, and here are further results:

  • you can decrease table size to have 0x82 elements (260 bytes), and still have worst error of 1, and average error of 0.32 (you need to put 0.5+ in rint() in this case)
  • you can decrease table size to have 0x42 elements (132 bytes), worst error becomes 2, and average error is 0.53 (you need to put 0.75+ in rint() in this case)
  • decreasing the table size further significantly increases worst error
geza
  • 28,403
  • 6
  • 61
  • 135
0

As your code is already very fast, I would try to unroll the loop. Writing the loop body 16 times renders the code unreadable, but saves the loop overhead and the expression 1 << (15 - i) becomes constant.

user5329483
  • 1,260
  • 7
  • 11
0

Trust your compiler :). Just use high enough level of optimisation and the compiler will sort this kind of micro optimalisations.

examples: gcc ARM - https://godbolt.org/g/4XdPCp

gcc - x86-64 https://godbolt.org/g/nBNmLR

gcc - AVR https://godbolt.org/g/Mq81Sg

So almost no branches, no cache flushes & misses (or at least a minimal number) - easy pipelined, optimum execution time

0___________
  • 60,014
  • 4
  • 34
  • 74