Why is one of these dynamic programming implementations of the Fibonacci sequence faster than the other?

Question

I've been researching and benchmarking various Fibonacci algorithms recently for my own amusement and more or less by accident came up with an alternate implementation of the classic O(n) time and O(1) space dynamic programming implementation.

Consider the following two functions:

BigInt fib_dp_classic(int n) {
  if (n == 0) {
    return 0;
  }

  BigInt x = 0, y = 1, z;
  for (int i = 2; i <= n; ++i) {
    z = x + y;
    x = y;
    y = z;
  }

  return y;
}

and

BigInt fib_dp_mod(int n) {
  BigInt x = 0, y = 1, z = 1;
  for (int i = 0; i < n; ++i) {
    switch (i % 3) {
      case 0:
        y = x + z;
        break;
      case 1:
        z = x + y;
        break;
      case 2:
        x = y + z;
        break;
    }
  }

  switch (n % 3) {
    case 0:
      return x;
      break;
    case 1:
      return y;
      break;
    case 2:
      return z;
      break;
  }
}

On my machine, calculating the millionth Fibonacci number takes 6.55s with fib_dp_classic and 2.83 seconds with fib_dp_mod, and even turning on -O3 doesn't change this too much. I don't really have any good ideas as to why the mod version is faster. Is it because the extra store instructions in the classic version are more expensive than the mod in the second? It's my understanding that the compiler should be putting all three variables in registers in both versions and computing the mod is actually fairly expensive; is this not the case?

In fact, I just put both of these through compiler explorer and both are using only registers once you turn optimizations on. Granted, this is only using ints, not the GMP-based bigints I was actually using for my benchmark. Is there some weird GMP implementation detail that might be the cause here?

Update: I even strace'd both to see if malloc() might be the culprit and fib_dp_classic uses 130 syscalls (for n=1000000) while fib_dp_mod uses 133. So still no real clues...

Update 2: Yes, the buffer copies are the culprit (as geza pointed out) and I was dumb for not realizing. Here are two alternate versions and their benchmark results:

BigInt fib_dp_move(int n) {
  if (n == 0) {
    return 0;
  }

  BigInt x = 0, y = 1, z;
  for (int i = 2; i <= n; ++i) {
    z = std::move(x) + y;
    x = std::move(y);
    y = std::move(z);
  }

  return y;
}

This runs in 2.84 seconds, so just about equivalent to the mod version since it eliminates the unnecessary copies.

BigInt fib_dp_swap(int n) {
  if (n == 0) {
    return 0;
  }

  BigInt x = 0, y = 1, z;
  for (int i = 2; i <= n; ++i) {
    z = x + y;
    swap(x, y);
    swap(y, z);
  }

  return y;
}

This (from geza) also runs in 2.84 seconds, so again about equivalent to the mod version since it eliminates the copies in basically the same way, just calling swap() instead of using move semantics.

Compiler Explorer doesn't support GMP yet. So for now it doesn't show native of your GMP version. Apparently, two loads of GMP's BigInt are what the classic pays more than mod version. Perhaps you may want to configure your own CE to look inside the native of those BigInt loads. — sandthorn, Jul 28 '18 at 03:05
I wonder, if you'd implement the 2nd version with `long int` and calculate only within its range, but several times, each (to compensate for the short time it takes to calculate), which one will be faster? — a concerned citizen, Jul 28 '18 at 05:44
My advice would be to learn how to use a profiler. It may take a bit of time the first time, but it is a very useful skill and can easily answer such questions. — Marc Glisse, Jul 28 '18 at 05:48
I think you can even be faster than that by moving out of the for loop the 2 first iterations (which are no-op btw), and then increment `i` by `3` at each iteration, and do the computation for `x`,`y` and `z` (in that order) in a single iteration. This will avoid the modulo 3 and the switch case in the inner loop, I think it can make a difference (and would be interested to know about the results!). — Olivier Sohn, Jul 28 '18 at 07:40
@OlivierSohn: it won't make a difference for the GMP case. 99.9% of time spent in GMP, the "driver" routine doesn't matter. — geza, Jul 28 '18 at 07:42
@geza maybe it will allow the compiler to do some more optimizations, and I don't believe the 99.9% is correct : it would mean that only 2 nanoseconds, on average are used by the non-bigint operations of the loop. I doubt that. — Olivier Sohn, Jul 28 '18 at 07:50
@OlivierSohn: fib(1000000) is a ~86kb number. It is **much** slower to work with than a simple modulo (which is optimized to a multiply anyway) + `switch`. Maybe 99.9% is an exaggeration, but surely it is more than 99%. So even if you remove the driver routine, the speedup will be less than 1%. — geza, Jul 28 '18 at 07:54
@geza maybe some branch mispredictions are also penalizing the loop. — Olivier Sohn, Jul 28 '18 at 08:09
@OlivierSohn: the bottleneck is not there. Sure, you can speed the routine up, but for Fib(1000000), the speedup will be at most 1% (presumably much less). — geza, Jul 28 '18 at 08:18
@geza I'm trying to benchmark this, do you have any idea where I can find BigInt? doesn't seem to be in gmp headers... — Olivier Sohn, Jul 28 '18 at 08:21
@OlivierSohn: I don't know, it's a proprietary interface to GMP perhaps. You maybe want to use the `gmpxx.h` headers in GMP, and the class `mpz_class`, it's very easy to use. — geza, Jul 28 '18 at 08:23
@Mark Schott, can you please indicate what `BigInt` is in your question? — Olivier Sohn, Jul 28 '18 at 08:49

score 7 · Answer 1 · answered Jul 28 '18 at 02:55

7

Your first function uses one addition and 3 copies for BigInt -- all of the are quite time consuming operations. The second function uses one addition and one copy operations -- that's where the savings come from, you save 2 copy operations with BigInt.

answered Jul 28 '18 at 02:55

lenik

23,228
4
34
43

Is it OK if I edit your answer and try to simplify it? – Jul 28 '18 at 03:03
1

@Gox what part seems too complicated to you ? – lenik Jul 28 '18 at 03:24
btw, what is `BigInt`? How "big" is it? (in term of size? is it dynamically allocated?) – Olivier Sohn Jul 28 '18 at 07:38
@OlivierSohn please, use google or open another question? – lenik Jul 28 '18 at 08:39
1

@lenik `BigInt` is used in the question, but it's not indicated what it is, so it's totally appropriate to ask here (and btw, you answer assumes an implementation of `BigInt`, so it may not be relevant depending on the actual implementation). – Olivier Sohn Jul 28 '18 at 08:41

geza · Accepted Answer · 2018-07-28T08:18:46.000

In this case, there is no point comparing the GMP version with the simple int version. Fib(1000000) is a ~86KB number, it is much slower to work with than a simple int. For ints, a copy can be basically free operation in certain circumstances, while for GMP numbers, it involves a copy of a 86KB buffer.

(Note, of course, the copy won't be 86KB always. In the beginning, it is ~0KB, but as the routine progresses, it grows to 86KB. And as the numbers grow, the routine becomes slower and slower, so the majority of the time is spent when the number is big)

Assuming a quality BigInt implementation, we have these operation counts in each iteration:

fib_dp_classic: 1 add, 2 copies
fib_dp_mod: 1 add

As you can see, the classic version does 2 extra copies (note, that in a quality implementation, x=y+z doesn't involve a copy). And the speed of a copy has the same order of magnitude as the add (I mean, an add may be 2x-3x slower than a copy). So this explains the ~2.3x slowdown of the classic routine.

Note, that if BigInt would be an implementation which uses reference counts, then x=y operation could be basically a free operation, because it doesn't need a copy, just incrementing a counter (In this case, the classic routine would have the same speed as the mod one).

Last note: presumably you can speed up the classic version, if a non-copying swap operation is available:

BigInt fib_dp_swap(int n) {
  if (n == 0) {
    return 0;
  }

  BigInt x = 0, y = 1, z;
  for (int i = 2; i <= n; ++i) {
    z = x + y;
    x.swap(y); // Note the 2 swaps here
    y.swap(z);
  }

  return y;
}

With gmpxx, this routine runs in the same time as the mod version, and much simpler.

It is because a swap operation can be much cheaper than a copy. Swap just needs to swap the pointers inside BigInt, while a copy needs a ~86KB memory-copy.

In gmp, is there no way to construct BigInt object "in place" without deallocation/allocation everytime we do `z = x + y;` ? — sandthorn, Jul 28 '18 at 10:57
@sandthorn: I don't think that GMP will reallocate every time. But, reallocation strategy is suboptimal, because the allocated space grows linearly, instead of exponentially. Supposedly the routine can be faster, if `x`, `y` and `z` had the necessary space preallocated. But I don't know how to achieve this cleanly. — geza, Jul 28 '18 at 11:13
Thanks for the detailed answer, I feel dumb for not realizing how big these buffers are and how different this is from the trivial `int` case. Updating my answer to include a few other optimizations, including your swap() idea. — Mark Schott, Jul 30 '18 at 20:51

score 0 · Answer 3 · answered Jul 28 '18 at 08:12

Its a case of data dependency in your classic each iteration takes two cycles as

z = x + y;
x = y; // depends on y from previous loop.
y = z; // depends on z

in the mod case each iteration depends only on the previous so it can do one per cycle.

Also the %3 is not really a modulo 3 but some compiler magic that makes it a simple operation. In fact you can properbly optimize it more by writing

  for (int i = 0; i < n; i+=3) {
      y = x + z;
      z = x + y;
      x = y + z;
    }
  }

The cost of the extra calculations will be saved in the extra control logic.

Why is one of these dynamic programming implementations of the Fibonacci sequence faster than the other?

3 Answers3