How is Rust --release build slower than Go?

Question

I'm trying to learn about Rust's concurrency and parallel computing and threw together a small script that iterates over a vector of vectors like it was an image's pixels. Since at first I was trying to see how much faster it gets iter vs par_iter I threw in a basic timer -- which is probably not amazingly accurate. However, I was getting crazy high numbers. So, I thought I would put together a similar piece of code on Go that allows for easy concurrency and the performance is ~585% faster!

Rust was tested with --release

I also tried using native thread pool but the results were the same. Looked at how many threads I was using and for a bit I was messing around with that as well, to no avail.

What am I doing wrong? (don't mind the definitely not performant way of creating a random value filled vector of vectors)

Rust code (~140ms)

use rand::Rng;
use std::time::Instant;
use rayon::prelude::*;

fn normalise(value: u16, min: u16, max: u16) -> f32 {
    (value - min) as f32 / (max - min) as f32
}

fn main() {
    let pixel_size = 9_000_000;
    let fake_image: Vec<Vec<u16>> = (0..pixel_size).map(|_| {
        (0..4).map(|_| {
            rand::thread_rng().gen_range(0..=u16::MAX)
        }).collect()
    }).collect();

    // Time starts now.
    let now = Instant::now();

    let chunk_size = 300_000;

    let _normalised_image: Vec<Vec<Vec<f32>>> = fake_image.par_chunks(chunk_size).map(|chunk| {
        let normalised_chunk: Vec<Vec<f32>> = chunk.iter().map(|i| {
            let r = normalise(i[0], 0, u16::MAX);
            let g = normalise(i[1], 0, u16::MAX);
            let b = normalise(i[2], 0, u16::MAX);
            let a = normalise(i[3], 0, u16::MAX);
            
            vec![r, g, b, a]
        }).collect();

        normalised_chunk
    }).collect();

    // Timer ends.
    let elapsed = now.elapsed();
    println!("Time elapsed: {:.2?}", elapsed);
}

Go code (~24ms)

package main

import (
    "fmt"
    "math/rand"
    "sync"
    "time"
)

func normalise(value uint16, min uint16, max uint16) float32 {
    return float32(value-min) / float32(max-min)
}

func main() {
    const pixelSize = 9000000
    var fakeImage [][]uint16

    // Create a new random number generator
    src := rand.NewSource(time.Now().UnixNano())
    rng := rand.New(src)

    for i := 0; i < pixelSize; i++ {
        var pixel []uint16
        for j := 0; j < 4; j++ {
            pixel = append(pixel, uint16(rng.Intn(1<<16)))
        }
        fakeImage = append(fakeImage, pixel)
    }

    normalised_image := make([][4]float32, pixelSize)
    var wg sync.WaitGroup

    // Time starts now
    now := time.Now()
    chunkSize := 300_000
    numChunks := pixelSize / chunkSize
    if pixelSize%chunkSize != 0 {
        numChunks++
    }

    for i := 0; i < numChunks; i++ {
        wg.Add(1)

        go func(i int) {
            // Loop through the pixels in the chunk
            for j := i * chunkSize; j < (i+1)*chunkSize && j < pixelSize; j++ {
                // Normalise the pixel values
                _r := normalise(fakeImage[j][0], 0, ^uint16(0))
                _g := normalise(fakeImage[j][1], 0, ^uint16(0))
                _b := normalise(fakeImage[j][2], 0, ^uint16(0))
                _a := normalise(fakeImage[j][3], 0, ^uint16(0))

                // Set the pixel values
                normalised_image[j][0] = _r
                normalised_image[j][1] = _g
                normalised_image[j][2] = _b
                normalised_image[j][3] = _a
            }

            wg.Done()
        }(i)
    }

    wg.Wait()

    elapsed := time.Since(now)
    fmt.Println("Time taken:", elapsed)
}

You are doing enormous amount of heap allocations in Rust by reconstructing `Vec`s which is heap allocated type. You don't do any in go. So heap alloc.s and mem. copy makes for time. — Siiir, Jul 27 '23 at 00:41
Do `.par_iter_mut().for_each(...)` in Rust and modify primitives in place. Also take care about cache locality – so use multidim arrays over nested vectors, or create 1-dim vec but make it think it's 3-dim – so that you'll get 1 sexy segment of memory which your CPU will fetch in a few bites. — Siiir, Jul 27 '23 at 00:47
@Siiir Thank you. Any advantage over a 1D array vs 1D vec in this case? — l1901, Jul 27 '23 at 00:58
`Vec` is just a `[T;_]` but on heap and with a buffer that can be shrunk (with `.shrink_to_fit()`. There's a certain size threshold I don't know when it's better to move an array to heap. I think it's somethink between 1KB-100KB when array is to large and should be replaced with Vec or Box<[T]> or Box<[T; N]>. You'll get cache locality when a CPU core will be able to fetch a continuous array of memory once and do many operations on it. All Rust arrays are continuous in memory, even if N-dimensional. Vec will point to continuous array of f32, but Vec>> not. — Siiir, Jul 27 '23 at 01:13
Consider modifying your code so that `normalized_image: Vec<[f32; 4]>`. This should drastically speed up the Rust code and will make it a fair comparison - the Rust type `[T; n]` is the equivalent of the Go type `[n]T`. Both types are an array of `n` elements of type `T`, where `n` is known at compile time. — Mark Saving, Jul 27 '23 at 01:52

Mark Saving · Accepted Answer · 2023-07-27T07:33:28.233

The most important initial change for speeding up your Rust code is using the correct type. In Go, you use a [4]float32 to represent an RBGA quadruple, while in Rust you use a Vec<f32>. The correct type to use for performance is [f32; 4], which is an array known to contain exactly 4 floats. An array with known size need not be heap-allocated, while a Vec is always heap-allocated. This improves your performance drastically - on my machine, it's a factor of 8 difference.

Original snippet:

    let fake_image: Vec<Vec<u16>> = (0..pixel_size).map(|_| {
        (0..4).map(|_| {
            rand::thread_rng().gen_range(0..=u16::MAX)
        }).collect()
    }).collect();

... 

    let _normalised_image: Vec<Vec<Vec<f32>>> = fake_image.par_chunks(chunk_size).map(|chunk| {
        let normalised_chunk: Vec<Vec<f32>> = chunk.iter().map(|i| {
            let r = normalise(i[0], 0, u16::MAX);
            let g = normalise(i[1], 0, u16::MAX);
            let b = normalise(i[2], 0, u16::MAX);
            let a = normalise(i[3], 0, u16::MAX);
            
            vec![r, g, b, a]
        }).collect();

        normalised_chunk
    }).collect();

New snippet:

    let fake_image: Vec<[u16; 4]> = (0..pixel_size).map(|_| {
    let mut result: [u16; 4] = Default::default();
    result.fill_with(|| rand::thread_rng().gen_range(0..=u16::MAX));
    result
    }).collect();

...

    let _normalised_image: Vec<Vec<[f32; 4]>> = fake_image.par_chunks(chunk_size).map(|chunk| {
        let normalised_chunk: Vec<[f32; 4]> = chunk.iter().map(|i| {
            let r = normalise(i[0], 0, u16::MAX);
            let g = normalise(i[1], 0, u16::MAX);
            let b = normalise(i[2], 0, u16::MAX);
            let a = normalise(i[3], 0, u16::MAX);
            
            [r, g, b, a]
        }).collect();

        normalised_chunk
    }).collect();

On my machine, this results in a roughly 7.7x speedup, bringing Rust and Go roughly to parity. The overhead of doing a heap allocation for every single quadruple slowed Rust down drastically and drowned out everything else; eliminating this puts Rust and Go on more even footing.

Second, there is a slight error in your Go code. In your Rust code, you calculate a normalized r, g, b, and a, while in your Go code, you only calculate _r, _g, and _b. I don't have Go installed on my machine, but I imagine this gives Go a slight unfair advantage over Rust, since you're doing less work.

Third, you are still not quite doing the same thing in Rust and Go. In Rust, you split the original image into chunks and, for each chunk, produce a Vec<[f32; 4]>. This means you still have a bunch of chunks sitting around in memory that you'll later have to combine into a single final image. In Go, you split the original chunks and, for each chunk, write the chunk into a common array. We can rewrite your Rust code further to perfectly mimic the Go code. Here is what this looks like in Rust:

let _normalized_image: Vec<[f32; 4]> = {
    let mut destination = vec![[0 as f32; 4]; pixel_size];
    
    fake_image
        .par_chunks(chunk_size)
        // The "zip" function allows us to iterate over a chunk of the input 
        // array together with a chunk of the destination array.
        .zip(destination.par_chunks_mut(chunk_size))
        .for_each(|(i_chunk, d_chunk)| {
        // Sanity check: the chunks should be of equal length.
        assert!(i_chunk.len() == d_chunk.len());
        for (i, d) in i_chunk.iter().zip(d_chunk) {
            let r = normalise(i[0], 0, u16::MAX);
            let g = normalise(i[1], 0, u16::MAX);
            let b = normalise(i[2], 0, u16::MAX);
            let a = normalise(i[3], 0, u16::MAX);
            
            *d = [r, g, b, a];

            // Alternately, we could do the following loop:
            // for j in 0..4 {
            //  d[j] = normalise(i[j], 0, u16::MAX);
            // }
        }
    });
    destination
};

Now your Rust code and your Go code are truly doing the same thing. I suspect you'll find the Rust code is slightly faster.

Finally, if you were doing this in real life, the first thing you should try would be using map as follows:

    let _normalized_image = fake_image.par_iter().map(|&[r, b, g, a]| {
    [ normalise(r, 0, u16::MAX),
      normalise(b, 0, u16::MAX),
      normalise(g, 0, u16::MAX),
      normalise(a, 0, u16::MAX),
      ]
    }).collect::<Vec<_>>();

This is just as fast as manually chunking on my machine.

Thank you for taking the time and explaining this in so much detail. When you say "Finally, if you were doing this in real life..." are you talking about just not chunking it, or the elegance of your code? — l1901, Jul 27 '23 at 17:45
@l1901 The first thing you should try is the simplest method that gets the job done. In real life, you would start by doing things sequentially (with just a normal `.iter().map(…).collect()`). Then, if that’s too slow, you could try using rayon and replacing `iter` with `par_iter`. If that’s still too slow, you should carefully benchmark possible changes, like manually chunking. Simple code is much more maintainable, and delegating to rayon’s choice of how to do `par_iter` and `collect` in parallel is probably the right move. — Mark Saving, Jul 27 '23 at 18:35

score 1 · Answer 2 · answered Jul 27 '23 at 05:23

use rand::Rng;
use std::time::Instant;
use rayon::prelude::*;

fn normalise(value: u16, min: u16, max: u16) -> f32 {
    (value - min) as f32 / (max - min) as f32
}

type PixelU16 = (u16, u16, u16, u16);
type PixelF32 = (f32, f32, f32, f32);

fn main() {
    let pixel_size = 9_000_000;
    let fake_image: Vec<PixelU16> = (0..pixel_size).map(|_| {
        let mut rng =
            rand::thread_rng();
        (rng.gen_range(0..=u16::MAX), rng.gen_range(0..=u16::MAX), rng.gen_range(0..=u16::MAX), rng.gen_range(0..=u16::MAX))
    }).collect();

    // Time starts now.
    let now = Instant::now();

    let chunk_size = 300_000;

    let _normalised_image: Vec<Vec<PixelF32>> = fake_image.par_chunks(chunk_size).map(|chunk| {
        let normalised_chunk: Vec<PixelF32> = chunk.iter().map(|i| {
            let r = normalise(i.0, 0, u16::MAX);
            let g = normalise(i.1, 0, u16::MAX);
            let b = normalise(i.2, 0, u16::MAX);
            let a = normalise(i.3, 0, u16::MAX);

            (r, g, b, a)
        }).collect::<Vec<_>>();

        normalised_chunk
    }).collect();

    // Timer ends.
    let elapsed = now.elapsed();
    println!("Time elapsed: {:.2?}", elapsed);
}

I have switched using arrays to tuple and the solution is already 10 times faster than the solution you provided on my machine. Speed could maybe even increased by cutting the Vec and using an Arc<Mutex<Vec<Pixel>>> or some mpsc channel by reducing the amount of heap allocations.

YthanZhang · Answer 3 · 2023-07-27T07:48:03.483

Vec<Vec<T>> is usually not recommended, because it's not very cache friendly, since you have Vec<Vec<Vec<T>>> the situation is even worse.

The process of memory allocation also cost a lot of time.

A slight improvement is to change the type to Vec<Vec<[T; N]>>, since the inner most Vec<T> should be a fixed size of 4 u16 or f32. This reduced the processing time on my PC from ~110ms down to 11ms.

fn rev1() {
    let pixel_size = 9_000_000;
    let chunk_size = 300_000;

    let fake_image: Vec<[u16; 4]> = (0..pixel_size)
        .map(|_| {
            core::array::from_fn(|_| rand::thread_rng().gen_range(0..=u16::MAX))
        })
        .collect();

    // Time starts now.
    let now = Instant::now();

    let _normalized_image: Vec<Vec<[f32; 4]>> = fake_image
        .par_chunks(chunk_size)
        .map(|chunk| {
            chunk
                .iter()
                .map(|rgba: &[u16; 4]| rgba.map(|v| normalise(v, 0, u16::MAX)))
                .collect()
        })
        .collect();

    // Timer ends.
    let elapsed = now.elapsed();
    println!("Time elapsed (r1): {:.2?}", elapsed);
}

However, this still requires a lot of allocation and copies. If a new vector is not needed, in place mutation can be even faster. ~5ms

pub fn rev2() {
    let pixel_size = 9_000_000;
    let chunk_size = 300_000;
    let mut fake_image: Vec<Vec<[f32; 4]>> = (0..pixel_size / chunk_size)
        .map(|_| {
            (0..chunk_size)
                .map(|_| {
                    core::array::from_fn(|_| {
                        rand::thread_rng().gen_range(0..=u16::MAX) as f32
                    })
                })
                .collect()
        })
        .collect();

    // Time starts now.
    let now = Instant::now();

    fake_image.par_iter_mut().for_each(|chunk| {
        chunk.iter_mut().for_each(|rgba: &mut [f32; 4]| {
            rgba.iter_mut().for_each(|v: &mut _| {
                *v = normalise_f32(*v, 0f32, u16::MAX as f32)
            })
        })
    });

    // Timer ends.
    let elapsed = now.elapsed();
    println!("Time elapsed (r2): {:.2?}", elapsed);
}

Here the Vec<Vec<T>> is still not ideal, while flattening it doesn't produce a significant performance improvement in this particular situation. Accessing an element in this nested array structure will be slower than a flat array.

/// Create a new flat Vec from fake_image
pub fn rev3() {
    let pixel_size = 9_000_000;
    let _chunk_size = 300_000;

    let fake_image: Vec<[u16; 4]> = (0..pixel_size)
        .map(|_| {
            core::array::from_fn(|_| rand::thread_rng().gen_range(0..=u16::MAX))
        })
        .collect();

    // Time starts now.
    let now = Instant::now();

    let _normalized_image: Vec<[f32; 4]> = fake_image
        .par_iter()
        .map(|rgba: &[u16; 4]| rgba.map(|v| normalise(v, 0, u16::MAX)))
        .collect();

    // Timer ends.
    let elapsed = now.elapsed();
    println!("Time elapsed (r3): {:.2?}", elapsed);
}

/// In place mutation of a flat Vec
pub fn rev4() {
    let pixel_size = 9_000_000;
    let _chunk_size = 300_000;

    let mut fake_image: Vec<[f32; 4]> = (0..pixel_size)
        .map(|_| {
            core::array::from_fn(|_| {
                rand::thread_rng().gen_range(0..=u16::MAX) as f32
            })
        })
        .collect();

    // Time starts now.
    let now = Instant::now();

    fake_image.par_iter_mut().for_each(|rgba: &mut [f32; 4]| {
        rgba.iter_mut()
            .for_each(|v: &mut _| *v = normalise_f32(*v, 0f32, u16::MAX as f32))
    });

    // Timer ends.
    let elapsed = now.elapsed();
    println!("Time elapsed (r4): {:.2?}", elapsed);
}

Sweet! So am I correct assessing for performance purposes: use array when possible, try to nest as little as possible, and when possible mut rather than recreate a Vec? — l1901, Jul 27 '23 at 17:52

How is Rust --release build slower than Go?

3 Answers3