Is this expression correct in C preprocessor

Question

I want to do the following arithmetic functions in a C pre-processor include statement when I send in the variable x.

#define calc_addr_data_reg (x) ( base_offset + ((x/7) * 0x20) + data_reg_offset)

How would I go about implementing the division and multiplication operations using bitshifts? In the division operation I only need the the quotient.

"How would I go about implementing the division and multiplication operations using bitshifts?" <- Let the compiler take care of that. — OmnipotentEntity, Mar 07 '13 at 00:05
Why would you want to do this with bitshifts? Did you search for relevant phrases already? — Oliver Charlesworth, Mar 07 '13 at 00:05
If using C++ do not use the preprocessor. An inline function would be a lot better. — Ed Heal, Mar 07 '13 at 00:05
How would I do division by bitshift? For example if I want to divide a number by 7. How would I go about doing that? — Falcata, Mar 07 '13 at 00:08
@Falcata: Why do you **want** to? If you want to divide by 7, then write `/ 7`. — Oliver Charlesworth, Mar 07 '13 at 00:10
Also the multiplication given that say for each pin bank that I'm trying to calculate the address for I have to go forward 20 bytes. I want to multiply this by the number of times that I have to go forward and get back a value in hex. — Falcata, Mar 07 '13 at 00:14
@Falcata, what the others are (quite rightly) trying to say, is to avoid the macro, and use an inline function. The compiler is better at writing code than you are. And that's not a bad thing. — Moo-Juice, Mar 07 '13 at 00:17
@Falcata - By all means you can simplify mathematical equations - it is call algebra. Get a sheet of paper and pencil. Write the equation down and use maths to make the equation have less operations. — Ed Heal, Mar 07 '13 at 00:25
The reason why I'd rather use a pre-processor is that its something that needs to go on an embedded function. An inline function would expand the binary size. Whereas with a preprocessor the required values would have been calculated at compile time. — Falcata, Mar 07 '13 at 00:27
No, it wouldn't. Inline functions are inlined by the compiler if the compiler has all of the pieces. Even if you don't specify inline it will still get inlined if you pass `-O2` on gcc because that enabled optimizations that might increase the size of the executable. Which inlining functions does. Inlining functions increases, rather than decreases the size of the executable. — OmnipotentEntity, Mar 07 '13 at 00:29
would it effect my code size and processing time vs if I used a c preprocessor macro? — Falcata, Mar 07 '13 at 00:32
@OmnipotentEntity - No it does not - sometimes having a function inlined reduces the size of the executable. Take getters/setters for example. Do not need the overhead of a function call/return stuff. — Ed Heal, Mar 07 '13 at 00:37

OmnipotentEntity · Answer 1 · 2013-03-07T02:50:47.803

To answer the questions,

"Is this expression correct in the C Preprocessor?"

I don't see anything wrong with it.

How would I go about implementing the division and multiplication operations using bitshifts? In the division operation I only need the the quotient.

The compiler is going to do a better job of optimizing your code than you will in almost all cases. If you have to ask StackOverflow how to do this, then you don't know enough to outperform GCC. I know I certainly don't. But because you asked here's how gcc optimizes it.

@EdHeal,

This needed a little bit more room to respond properly. You're absolutely correct in the example you gave (getters and setters), but in this particular example, inlineing the function would slightly increase side of the binary, assuming that it's called a few times.

GCC compiles the function to:

mov ecx, edx
mov edx, -1840700269
mov eax, edi
imul    edx
lea eax, [rdx+rdi]
sar eax, 2
sar edi, 31
sub eax, edi
sal eax, 5
add esi, eax
lea eax, [rsi+rcx]
ret

Which is more bytes than the assembly for calling and getting a return value from the function, which is 3 push statements, a call, a return, and a pop statement (presumably).

with -Os it compiles into:

mov eax, edi
mov ecx, 7
mov edi, edx
cdq
idiv    ecx
sal eax, 5
add eax, esi
add eax, edi
ret

Which is less bytes than the call return push and pops.

So in this case it really matters what compiler flags he uses whether or not the code is smaller or larger when inlining.

To Op again:

Explaining what the code up there means:

The next part of this post is ripped directly from: http://porn.quiteajolt.com/2008/04/30/the-voodoo-of-gcc-part-i/

The proper reaction to this monstrosity is “wait what.” Some specific instructions that I think could use more explanation:

movl $-1840700269, -4(%ebp)

-1840700269 = -015555555555 in octal (indicated by the leading zero). I’ll be using the octal representation because it looks cooler.

imull %ecx

This multiplies %ecx and %eax. Both of these registers contain a 32-bit number, so this multiplication could possibly result in a 64-bit number. This can’t fit into one 32-bit register, so the result is split across two: the high 32 bits of the product get put into %edx, and the low 32 get put into %eax.

leal (%edx,%ecx), %eax

This adds %edx and %ecx and puts the result into %eax. lea‘s ostensible purpose is for address calculations, and it would be more clear to write this as two instructions: an add and a mov, but that would take two clock cycles to execute, whereas this takes just one.

Also note that this instruction uses the high 32 bits of the multiplication from the previous instruction (stored in %edx) and then overwrites the low 32 bits in %eax, so only the high bits from the multiplication are ever used.

sarl $2, %edx   # %edx = %edx >> 2

Technically, whether or not sar (arithmetic right shift) is equivalent to the >> operator is implementation-defined. gcc guarantees that the operator is an arithmetic shift for signed numbers (“Signed `>>’ acts on negative numbers by sign extension”), and since I’ve already used gcc once, let’s just assume I’m using it for the rest of this post (because I am).

sarl $31, %eax

%eax is a 32-bit register, so it’ll be operating on integers in the range [-231, 231 - 1]. This produces something interesting: this calculation only has two possible results. If the number is greater than or equal to 0, the shift will reduce the number to 0 no matter what. If the number is less than 0, the result will be -1.

Here’s a pretty direct rewrite of this assembly back into C, with some integer-width paranoia just to be on the safe side, since a few of these steps are dependent on integers being exactly 32 bits wide:

int32_t divideBySeven(int32_t num) {
    int32_t eax, ecx, edx, temp; // push %ebp / movl %esp, %ebp / subl $4, %esp
    ecx = num; // movl 8(%ebp), %ecx
    temp = -015555555555; // movl $-1840700269, -4(%ebp)
    eax = temp; // movl -4(%ebp), %eax

    // imull %ecx - int64_t casts to avoid overflow
    edx = ((int64_t)ecx * eax) >> 32; // high 32 bits
    eax = (int64_t)ecx * eax; // low 32 bits

    eax = edx + ecx; // leal (%edx,%ecx), %eax
    edx = eax; // movl %eax, %edx
    edx = edx >> 2; // sarl $2, %edx

    eax = ecx; // movl %ecx, %eax
    eax = eax >> 31; // sarl $31, %eax

    ecx = edx; // movl %edx, %ecx
    ecx = ecx - eax; // subl %eax, %ecx
    eax = ecx; // movl %ecx, %eax
    return eax; // leave / ret
}

Now there’s clearly a whole bunch of inefficient stuff here: unnecessary local variables, a bunch of unnecessary variable swapping, and eax = (int64_t)ecx * eax1; is not needed at all (I just included it for completion’s sake). So let’s clean that up a bit. This next listing just has the most of the cruft eliminated, with the corresponding assembly above each block:

int32_t divideBySeven(int32_t num) {
    // pushl %ebp
    // movl %esp, %ebp
    // subl $4, %esp
    // movl 8(%ebp), %ecx
    // movl $-1840700269, -4(%ebp)
    // movl -4(%ebp), %eax
    int32_t eax, edx;
    eax = -015555555555;

    // imull %ecx
    edx = ((int64_t)num * eax) >> 32;

    // leal (%edx,%ecx), %eax
    // movl %eax, %edx
    // sarl $2, %edx
    edx = edx + num;
    edx = edx >> 2;

    // movl %ecx, %eax
    // sarl $31, %eax
    eax = num >> 31;

    // movl %edx, %ecx
    // subl %eax, %ecx
    // movl %ecx, %eax
    // leave
    // ret
    eax = edx - eax;
    return eax;
}

And the final version:

int32_t divideBySeven(int32_t num) {
    int32_t temp = ((int64_t)num * -015555555555) >> 32;
    temp = (temp + num) >> 2;
    return (temp - (num >> 31));
}

I still have yet to answer the obvious question, “why would they do that?” And the answer is, of course, speed. The integer division instruction used in the very first listing, idiv, takes a whopping 43 clock cycles to execute. But the divisionless method that gcc produces has quite a few more instructions, so is it really faster overall? This is why we have the benchmark.

int main(int argc, char *argv[]) {
    int i = INT_MIN;
    do {
        divideBySeven(i);
        i++;
    } while (i != INT_MIN);

    return 0;
}

Loop over every single possible integer? Sure! I ran the test five times for both implementations and timed it with time. The user CPU times for gcc were 45.9, 45.89, 45.9, 45.99, and 46.11 seconds, while the times for my assembly using the idiv instruction were 62.34, 62.32, 62.44, 62.3, and 62.29 seconds, meaning the naive implementation ran about 36% slower on average. Yeow.

Compiler optimizations are a beautiful thing.

Ok, I'm back, now why does this work?

int32_t divideBySeven(int32_t num) {
    int32_t temp = ((int64_t)num * -015555555555) >> 32;
    temp = (temp + num) >> 2;
    return (temp - (num >> 31));
}

Let's take a look at the first part:

int32_t temp = ((int64_t)num * -015555555555) >> 32;

Why this number?

Well, let's take 2^64 and divide it by 7 and see what pops out.

2^64 / 7 = 2635249153387078802.28571428571428571429

That looks like a mess, what if we convert it into octal?

0222222222222222222222.22222222222222222222222

That's a very pretty repeating pattern, surely that can't be a coincidence. I mean we remember that 7 is 0b111 and we know that when we divide by 99 we tend to get repeating patterns in base 10. So it makes sense that we'd get a repeating pattern in base 8 when we divide by 7.

So where does our number come in?

(int32_t)-1840700269 is the same as (uint_32t)2454267027

* 7 = 17179869189

And finally 17179869184 is 2^34

Which means that 17179869189 is the closest multiple of 7 2^34. Or to put it another way 2454267027 is the largest number that will fit in a uint32_t which when multiplied by 7 is very close to a power of 2

What's this number in octal?

0222222222223

Why is this important? Well, we want to divide by 7. This number is 2^34/7... approximately. So if we multiply by it, and then leftshift 34 times, we should get a number very close to the exact number.

The last two lines look like they were designed to patch up approximation errors.

Perhaps someone with a little more knowledge and/or expertise in this field can chime in on this.

>>> magic = 2454267027
>>> def div7(a):
...   if (int(magic * a >> 34) != a // 7):
...     return 0
...   return 1
... 
>>> for a in xrange(2**31, 2**32):
...   if (not div7(a)):
...     print "%s fails" % a
...

Failures begin at 3435973841 which is, funnily enough 0b11001100110011001100110011010001

As usual the writers of a compiler has had more time to think things through. (Also why worry too much on a few extra Ks in executable when you have a good speed improvement due to being able to hit the cache better for small functions). It is like my friend said that he needed a new computer. I said chuck some more RAM into it and the thing will stop thrashing the HD. It worked like a treat. — Ed Heal, Mar 07 '13 at 01:33
Totally :) But this stuff is interesting in its own right. Currently writing up an explanation of why the optimization works. — OmnipotentEntity, Mar 07 '13 at 01:45
+1 for thoroughness… with this info he could actually construct a preprocessor macro to do the pseudo-division! (lol) — Potatoswatter, Mar 07 '13 at 02:46

Is this expression correct in C preprocessor

1 Answers1

Linked