Why is swapping elements of a []float64 in Go faster than swapping elements of a Vec in Rust?

Question

I have two (equivalent?) programs, one in Go the other in Rust. The average execution time is:

Go ~169ms
Rust ~201ms

Go

package main

import (
    "fmt"
    "time"
)

func main() {
    work := []float64{0.00, 1.00}
    start := time.Now()

    for i := 0; i < 100000000; i++ {
        work[0], work[1] = work[1], work[0]
    }

    elapsed := time.Since(start)
    fmt.Println("Execution time: ", elapsed)
}

Rust

I compiled with --release

use std::time::Instant;

fn main() {
    let mut work: Vec<f64> = Vec::new();
    work.push(0.00);
    work.push(1.00);

    let now = Instant::now();

    for _x in 1..100000000 {
        work.swap(0, 1); 
    }

    let elapsed = now.elapsed();
    println!("Execution time: {:?}", elapsed);
}

Is Rust less performant than Go in this instance? Could the Rust program be written in an idiomatic way, to execute faster?

This kind of microbenchmark is unlikely to yield useful data. — Adrian, Oct 22 '18 at 16:35
I have just compiled and ran both your benchmarks. I used `go build -o b1 bench.go` for the Go code and `rustc -O -o b2 bench.rs` for the Rust code. The go benchmark takes ~180ms the Rust benchmark takes ~3ms — Mad Wombat, Oct 22 '18 at 16:36
At the same time, if I omit the -O flag for the rust compiler, the rust benchmark takes about 5 seconds to complete. — Mad Wombat, Oct 22 '18 at 16:37
Also, it seems like there is a huge difference between nightly and stable Rust compiler. Stable results in around 150ms results, still better than Go, but not by much. Nightly results in 2-4ms results, two orders of magnitude better than Go and stable Rust. — Mad Wombat, Oct 22 '18 at 16:44
It is entirely possible, that nightly Rust compiler figured out that the loop is deterministic just skipped the whole thing in favor of the end result :) But I am too lazy to dig around assembly to figure it out. — Mad Wombat, Oct 22 '18 at 16:48
@MadWombat probably the optimizer learned some new tricks. Since `work` is never actually read from, none of the writes to it matter. — Shepmaster, Oct 22 '18 at 16:48
Well, predictably, after I added a print statement for work var to both benchmarks, the results became more comparable. Rust still wins. Rust takes about 150ms and Go is 180-190ms. — Mad Wombat, Oct 22 '18 at 16:51
@MadWombat Interestingly, when I compiled with `rustc -O -o` I saw negligible difference between the two languages at ~170ms. It appears `--release` != fully optimised — Kyle, Oct 22 '18 at 17:08
So basically, all these experiments seem to tell us that the benchmarks are not very conclusive. — Mad Wombat, Oct 22 '18 at 17:14
`i := 0; i < 100000000; i++` and `1..100000000` are not the same range. — Shepmaster, Oct 22 '18 at 18:54

score 9 · Accepted Answer · answered Oct 22 '18 at 16:51

9

Could the Rust program be written in an idiomatic way, to execute faster?

Yes. To create a vector with a few elements, use the vec![] macro:

let mut work: Vec<f64> = vec![0.0, 1.0];    

for _x in 1..100000000 {
    work.swap(0, 1); 
}

So is this code faster? Yes. Have a look at what assembly is generated:

example::main:
  mov eax, 99999999
.LBB0_1:
  add eax, -11
  jne .LBB0_1
  ret

On my PC, this runs about 30 times faster than your original code.

Why does the assembly still contain this loop that is doing nothing? Why isn't the compiler able to see that two pushes are the same as vec![0.0, 1.0]? Both very good questions and both probably point to a flaw in LLVM or the Rust compiler.

However, sadly, there isn't much useful information to gain from your micro benchmark. Benchmarking is hard, like really hard. There are so many pitfalls that even professionals fall for. In your case, the benchmark is flawed in several ways. For a start, you never observe the contents of the vector later (it is never used). That's why a good compiler can remove all code that even touches the vector (as the Rust compiler did above). So that's not good.

Apart from that, this does not resemble any real performance critical code. Even if the vector would be observed later, swapping an odd number of times equals a single swap. So unless you wanted to see if the optimizer could understand this swapping rule, sadly your benchmark isn't really useful.

answered Oct 22 '18 at 16:51

Lukas Kalbertodt

79,749
26
255
305

Appreciate the very clear and helpful answer. The original problem was implementing a bubble sort in both languages (purely for the educational value). The memory swap is where Go pulled ahead, so thats all I included in this question, perhaps at the detriment of context/clarity. With 100,000 elements, the difference in execution time was into the seconds. – Kyle Oct 22 '18 at 17:25
3

@Kyle if you have a more complete example of a bubble sort that's slower than expected, you should open another question – PitaJ Oct 22 '18 at 17:29
The most puzzling thing about the generated code is `add eax, -11`. It's completely beyond me where _that_ is coming from. If it was some even number, I would assume that the optimizer had unrolled that number of iterations and figured out that an even number of swaps is a no-op. But with 11, I don't really have an idea. – Sven Marnach Oct 22 '18 at 19:15
@SvenMarnach That confused me too! I guess you are pretty close already with partial loop unrolling and LLVM optimizing 11 iterations into one swap. But why start at 11 and not notice that like... 8 iterations compile down to nothing? Maybe we should ask a question on this StackOverflow everyone is talking about? ^_^ – Lukas Kalbertodt Oct 22 '18 at 19:32
Another possibility in this example, when you know that your Vec doesn't need to grow, is to use a fixed-size array. On 1m iterations, using criterion on my ancient machine: original at 5.0338ms, with macro 4.5929ms, using an array 3.0950ns – Kellen Oct 24 '18 at 13:45

score 2 · Answer 2 · answered Oct 22 '18 at 18:09

2

(Not an answer) but to augment what Lukas wrote, here's what Go 1.11 generates for the loop itself:

    xorl    CX, CX
    movsd   8(AX), X0
    movsd   (AX), X1
    movsd   X0, (AX)
    movsd   X1, 8(AX)
    incq    CX
    cmpq    CX, $100000000
    jlt     68

(Courtesy of https://godbolt.org)

In either case, note that most probably the time you measured was dominated by the startup and initialization of the processes, so you did not actually measured the speed of the execution of the loops. IOW your approach is not correct.

answered Oct 22 '18 at 18:09

kostix

51,517
14
93
176

2

*the time you measured was dominated by the startup and initialization of the processes* — the code OP presented contains the time invocations. Are you stating that both Go and Rust code are running some other code besides the loop between `time.Now` / `time.Since` or `Instant::now` / `now.elapsed`? – Shepmaster Oct 22 '18 at 18:51
@Shepmaster, yes, you're right. I stand corrected, thanks. – kostix Oct 23 '18 at 09:31

Why is swapping elements of a []float64 in Go faster than swapping elements of a Vec in Rust?

2 Answers2