What optimization techniques are applied to Rust code that sums up a simple arithmetic sequence?

Question

The code is naive:

use std::time;

fn main() {
    const NUM_LOOP: u64 = std::u64::MAX;
    let mut sum = 0u64;
    let now = time::Instant::now();
    for i in 0..NUM_LOOP {
        sum += i;
    }
    let d = now.elapsed();
    println!("{}", sum);
    println!("loop: {}.{:09}s", d.as_secs(), d.subsec_nanos());
}

The output is:

$ ./test.rs.out
9223372036854775809
loop: 0.000000060s
$ ./test.rs.out
9223372036854775809
loop: 0.000000052s
$ ./test.rs.out
9223372036854775809
loop: 0.000000045s
$ ./test.rs.out
9223372036854775809
loop: 0.000000041s
$ ./test.rs.out
9223372036854775809
loop: 0.000000046s
$ ./test.rs.out
9223372036854775809
loop: 0.000000047s
$ ./test.rs.out
9223372036854775809
loop: 0.000000045s

The program almost ends immediately. I also wrote an equivalent code in C using for loop, but it ran for a long time. I'm wondering what makes the Rust code so fast.

The C code:

#include <stdint.h>
#include <time.h>
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <time.h>

double time_elapse(struct timespec start) {
    struct timespec now;
    clock_gettime(CLOCK_MONOTONIC, &now);
    return now.tv_sec - start.tv_sec +
           (now.tv_nsec - start.tv_nsec) / 1000000000.;
}

int main() {
    const uint64_t NUM_LOOP = 18446744073709551615u;
    uint64_t sum = 0;
    struct timespec now;
    clock_gettime(CLOCK_MONOTONIC, &now);

    for (int i = 0; i < NUM_LOOP; ++i) {
        sum += i;
    }

    double t = time_elapse(now);
    printf("value of sum is: %llu\n", sum);
    printf("time elapse is: %lf sec\n", t);

    return 0;
}

The Rust code is compiled using -O and the C code is compiled using -O3. The C code is running so slow that it hasn't stopped yet.

After fixing the bug found by visibleman and Sandeep, both programs were printing the same number in almost no time. I tried to reduce NUM_LOOP by one, results seemed reasonable considering an overflow. Moreover, with NUM_LOOP = 1000000000, both programs will not have overflow and produce correct answers in no time. What optimizations are used here? I know we can use simple equations like (0 + NUM_LOOP - 1) * NUM_LOOP / 2 to compute the result, but I don't think such computations are done by the compilers with an overflow case...

Besides `-O3` it's worth adding `-Wall -Wextra` – StoryTeller - Unslander Monica Oct 24 '18 at 05:58 — StoryTeller - Unslander Monica, Oct 24 '18 at 05:58

score 9 · Accepted Answer · edited Oct 24 '18 at 14:00

Your Rust code (without the prints and timing) compiles down to (On Godbolt):

movabs rax, -9223372036854775807
ret

LLVM just const-folds the whole function and calculates the final value for you.

Let's make the upper limit dynamic (non constant) to avoid this aggressive constant folding:

pub fn foo(num: u64) -> u64 {
    let mut sum = 0u64;
    for i in 0..num {
        sum += i;
    }

    sum
}

This results in (Godbolt):

  test rdi, rdi            ; if num == 0
  je .LBB0_1               ; jump to .LBB0_1
  lea rax, [rdi - 1]       ; sum = num - 1
  lea rcx, [rdi - 2]       ; rcx = num - 2
  mul rcx                  ; sum = sum * rcx
  shld rdx, rax, 63        ; rdx = sum / 2
  lea rax, [rdx + rdi]     ; sum = rdx + num
  add rax, -1              ; sum -= 1
  ret
.LBB0_1:
  xor eax, eax             ; sum = 0
  ret

As you can see that optimizer understood that you summed all numbers from 0 to num and replaced your loop with a constant formula: ((num - 1) * (num - 2)) / 2 + num - 1. As for the example above: the optimizer probably first optimized the code into this constant formula and did constant folding then.

Additional notes

The two other answers already point out your bug in the C program. When fixed, clang generates exactly the same assembly (unsurprisingly). However, GCC doesn't seem to know about this optimization and generates pretty much the assembly you would expect (a loop).
In Rust, an easier and more idiomatic way to write your code would be (0..num).sum(). Despite this using more layers of abstractions (namely, iterators), the compiler generates exactly the same code as above.
To print a Duration in Rust, you can use the {:?} format specifier. println!("{:.2?}", d); prints the duration in the most fitting unit with a precision of 2. That's a fine way to print the time for almost all kinds of benchmarks.

score 7 · Answer 2 · edited Oct 24 '18 at 14:37

7

Since an int can never be as big as your NUM_LOOP, the program will loop eternally.

const uint64_t NUM_LOOP = 18446744073709551615u;

for (int i = 0; i < NUM_LOOP; ++i) { // Change this to an uint64_t

If you fix the int bug, the compiler will optimize away these loops in both cases.

edited Oct 24 '18 at 14:37

Shepmaster

388,571
95
1,107
1,366

answered Oct 24 '18 at 05:57

visibleman

3,175
1
14
27

Sorry, forgot that part. Really appreciate your help. I have another question: if the compiler optimized away the loops, where do the programs get the numbers? Both programs are printing the same number, and I tried to reduce NUM_LOOP by one and the result is 9223372036854775811 from both programs. Considering the overflow, it makes sense. If the loops are optimized out, how can we get the numbers? I also tried NUM_LOOP=1000000000, which will not produce an overflow, and results from both programs are 499999999500000000 in almost no time. How can the programs do that? – Sanhu Li Oct 24 '18 at 09:26
2

Actually, I was planning to update my answer. The loop can be expressed arithmetically in constant time. And compilers are smart enough to do this. However I tested earlier, and the results are not clear cut on if this optimisation takes place, it depends on compiler version and the value of the NUM_LOOP constant. – visibleman Oct 24 '18 at 10:11

score 5 · Answer 3 · answered Oct 24 '18 at 05:57

5

Your code is stuck in an infinite loop.

The comparison i < NUM_LOOP will always return true since int i will wrap around before reaching NUM_LOOP

answered Oct 24 '18 at 05:57

Sandy

895
6
17

What optimization techniques are applied to Rust code that sums up a simple arithmetic sequence?

3 Answers3

Additional notes