Why is Python set intersection faster than Rust HashSet intersection?

Question

Here is my Python code:

len_sums = 0
for i in xrange(100000):
    set_1 = set(xrange(1000))
    set_2 = set(xrange(500, 1500))
    intersection_len = len(set_1.intersection(set_2))
    len_sums += intersection_len
print len_sums

Here is my Rust code:

use std::collections::HashSet;

fn main() {
    let mut len_sums = 0;
    for _ in 0..100000 {
        let set_1: HashSet<i32> = (0..1000).collect();
        let set_2: HashSet<i32> = (500..1500).collect();
        let intersection_len = set_1.intersection(&set_2).count();
        len_sums += intersection_len;
    }
    println!("{}", len_sums);
}

I believe these are roughly equivalent. I get the following performance results:

time python set_performance.py
50000000

real    0m11.757s
user    0m11.736s
sys 0m0.012s

and

rustc set_performance.rs -O       
time ./set_performance 50000000

real    0m17.580s
user    0m17.533s
sys 0m0.032s

Building with cargo and --release give the same result.

I realize that Python's set is implemented in C, and so is expected to be fast, but I did not expect it to be faster than Rust. Wouldn't it have to do extra type checking that Rust would not?

Perhaps I'm missing something in the way I compile my Rust program, are there any other optimizations flags that I should be using?

Another possibility is that the code is not really equivalent, and Rust is doing unnecessary extra work, am I missing anything?

Python version:

In [3]: import sys

In [4]: sys.version
Out[4]: '2.7.6 (default, Jun 22 2015, 17:58:13) \n[GCC 4.8.2]'

Rust version

$ rustc --version
rustc 1.5.0 (3d7cd77e4 2015-12-04)

I am using Ubuntu 14.04 and my system architecture is x86_64.

When I move the set-building out of the loop and only repeat the intersection, for both cases of course, Rust is faster than python2.7. So the question is slightly wrong. — bluss, Feb 16 '16 at 17:51
@bluss good point, on my machine `rust` is only a tiny bit faster, `0m4.168s` vs `0m3.838s`. And the initialization was taking a good bit of time. Thanks again. — Akavall, Feb 16 '16 at 18:01
@bluss *But* if I use `set1 & set2` on PyPy3 I get 1.0s vs 2.3s, so Python's back in the lead ;P — Veedrac, Feb 16 '16 at 18:45

score 26 · Accepted Answer · edited Feb 16 '16 at 19:06

When I move the set-building out of the loop and only repeat the intersection, for both cases of course, Rust is faster than Python 2.7.

I've only been reading Python 3 (setobject.c), but Python's implementation has some things going for it.

It uses the fact that both Python set objects use the same hash function, so it does not recompute the hash. Rust HashSets have instance-unique keys for their hash functions, so during intersection they must rehash keys from one set with the other set's hash function.

On the other hand, Python must call out to a dynamic key comparison function like PyObject_RichCompareBool for each matching hash, while the Rust code uses generics and will specialize the hash function and comparison code for i32. The code for hashing an i32 in Rust looks relatively cheap, and much of the hashing algorithm (handling longer input than 4 bytes) is removed.

It appears it's the construction of the sets that sets Python and Rust apart. And in fact not just construction, there's some significant code running to destruct the Rust HashSets as well. (This can be improved, filed bug here: #31711)

*Rust HashSets have instance-unique keys for their hash functions, so during intersection they must rehash keys from one set with the other set's hash function.* => could this be optimized out? I am thinking of maybe having a method on the `BuilderHashDefault` or just *comparing* said builders between the two instances of the `HashSet`/`HashMap` to optimize out the hash recomputation when possible. This way, you could use the same builder or equivalent builders on the sets on which you need to perform intersection/union/... — Matthieu M., Feb 17 '16 at 09:30
Since elements need to be and are rehashed (they're fed to contains_key on the other set), why does the method require the same type of hash function on both sets, i.e. why is there is only one generic type parameter `S: BuildHasher + Default`? — Stein, Dec 19 '18 at 21:31

score 18 · Answer 2 · edited May 23 '17 at 10:29

The performance problem boils down to the default hashing implementation of HashMap and HashSet. Rust's default hash algorithm is a good general-purpose one that also prevents against certain types of DOS attacks. However, it doesn't work great for very small or very large amounts of data.

Some profiling showed that make_hash<i32, std::collections::hash::map::RandomState> was taking up about 41% of the total runtime. As of Rust 1.7, you can choose which hashing algorithm to use. Switching to the FNV hashing algorithm speeds up the program considerably:

extern crate fnv;

use std::collections::HashSet;
use std::hash::BuildHasherDefault;
use fnv::FnvHasher;

fn main() {
    let mut len_sums = 0;
    for _ in 0..100000 {
        let set_1: HashSet<i32, BuildHasherDefault<FnvHasher>> = (0..1000).collect();
        let set_2: HashSet<i32, BuildHasherDefault<FnvHasher>> = (500..1500).collect();
        let intersection_len = set_1.intersection(&set_2).count();
        len_sums += intersection_len;
    }
    println!("{}", len_sums);
}

On my machine, this takes 2.714s compared to Python's 9.203s.

If you make the same changes to move the set building out of the loop, the Rust code takes 0.829s compared to the Python code's 3.093s.

Stein · Answer 3 · 2019-07-02T12:59:27.520

Hashing aside, Python races past previous versions of Rust when you intersect a tiny and a huge set the wrong way around. E.g. this code on playground:

use std::collections::HashSet;
fn main() {
    let tiny: HashSet<i32> = HashSet::new();
    let huge: HashSet<i32> = (0..1_000).collect();
    for (left, right) in &[(&tiny, &huge), (&huge, &tiny)] {
        let sys_time = std::time::SystemTime::now();
        assert_eq!(left.intersection(right).count(), 0);
        let elapsed = sys_time.elapsed().unwrap();
        println!(
            "{:9}ns starting from {:4} element set",
            elapsed.subsec_nanos(),
            left.len(),
        );
    }
}

when run with the 1.32 or earlier versions of Rust rather than a current version, reveals that you really want to invoke the intersection method on the smaller of the two sets (even in the borderline case that one set is empty). I got nice performance gains by calling this function instead of the intersection method:

fn smart_intersect<'a, T, S>(
    s1: &'a HashSet<T, S>,
    s2: &'a HashSet<T, S>,
) -> std::collections::hash_set::Intersection<'a, T, S>
where
    T: Eq + std::hash::Hash,
    S: std::hash::BuildHasher,
{
    if s1.len() < s2.len() {
        s1.intersection(s2)
    } else {
        s2.intersection(s1)
    }
}

The method in Python treats the two sets equally (at least in version 3.7).

PS Why is this? Say small set Sa has A items, big set Sb has B items, it takes Th time to hash one key, Tl(X) time to locate a hashed key in a set with X elements. Then:

Sa.intersection(&Sb) costs A * (Th + Tl(B))
Sb.intersection(&Sa) costs B * (Th + Tl(A))

Assuming the hash function is good and the buckets plenty (because if we're worrying about performance of intersection, so we should have made sure that the sets are efficient to begin with) then Tl(B) should be on par with Tl(A), or at least Tl(X) should scale much less than linearly with set size. Therefore it's A versus B that determines the cost of the operation.

PS The same problem and workaround existed for is_disjoint and also a bit for union (it's cheaper to copy the big set and add a few elements, than it is to copy the small set and add a lot, but not hugely). A pull request was merged in, so this discrepancy has disappeared since Rust 1.35.

You might as well see if you can submit a PR to the standard library for this. Seems like a safe enough change to make for everyone. — Shepmaster, Dec 18 '18 at 21:37
Perhaps people would argue against a cost for those who know up front that the left set is smaller, or even worse, who know it's bigger but require the left set to be there because of slow hashing functions or something else? In any case, at least it should be documented and right now it isn't (where I looked). — Stein, Dec 18 '18 at 21:59
On closer inspection [of the code](https://github.com/rust-lang/rust/tree/master/src/libstd/collections/hash/set.rs), it doesn't seem like too much thought went into it, so trying to provide the fix along with a bug report. — Stein, Dec 19 '18 at 11:49
By the way, the default Set in Scala has the same performance quirk: you need to put the small set first. — Stein, Feb 15 '19 at 23:05

Why is Python set intersection faster than Rust HashSet intersection?

3 Answers3

Linked