21

If I run these benchmarks in Rust:

#[bench]
fn bench_rnd(b: &mut Bencher) {
    let mut rng = rand::weak_rng();
    b.iter(|| rng.gen_range::<f64>(2.0, 100.0));
}

#[bench]
fn bench_ln(b: &mut Bencher) {
    let mut rng = rand::weak_rng();
    b.iter(|| rng.gen_range::<f64>(2.0, 100.0).ln());
}

The result is:

test tests::bench_ln             ... bench:        121 ns/iter (+/- 2)
test tests::bench_rnd            ... bench:          6 ns/iter (+/- 0)

121-6 = 115 ns per ln call.

But the same benchmark in Java:

@State(Scope.Benchmark)
public static class Rnd {
    final double x = ThreadLocalRandom.current().nextDouble(2, 100);
}

@Benchmark
public double testLog(Rnd rnd) {
    return Math.log(rnd.x);
}

Gives me:

Benchmark    Mode Cnt  Score  Error Units
Main.testLog avgt  20 31,555 ± 0,234 ns/op

The log is ~3.7 times slower (115/31) in Rust than in Java.

When I test the hypotenuse implementation (hypot), the implementation in Rust is 15.8 times faster than in Java.

Have I written bad benchmarks or it is a performance issue?

Responses to questions asked in comments:

  1. "," is a decimal separator in my country.

  2. I run Rust's benchmark using cargo bench which always runs in release mode.

  3. The Java benchmark framework (JMH) creates a new object for every call, even though it's a static class and a final variable. If I add a random creation in the tested method, I get 43 ns/op.

Shepmaster
  • 388,571
  • 95
  • 1,107
  • 1,366
  • 1
    Isnt java bad to use as benchmark basis? I mean java is nice but in some cases, it is too nice – Wietlol Jul 11 '17 at 14:45
  • 3
    You're probably benchmarking the random number generator more than the log function. Also, I believe that Rust just uses the system math library, so a pure call to `log` should be the same as what it is in C (no idea about Java). – Simon Byrne Jul 11 '17 at 17:14
  • 5
    Could you rerun the test using `RUSTFLAGS='-Ctarget-cpu=native' cargo bench`? – kennytm Jul 11 '17 at 17:21
  • 2
    @SimonByrne if OP were accidentally benchmarking the RNG, wouldn't that be offset by the `bench_rnd` function which *only* tests the RNG? That's why OP subtracts the two Rust benchmark timings — to make a pure benchmark of the `ln` function. I'd agree that it should just be calling the system math library though. – Shepmaster Jul 11 '17 at 18:37
  • 2
    I know nothing in java but are you sure that x is update ? – Stargateur Jul 12 '17 at 00:25

2 Answers2

14

The answer was given by @kennytm:

export RUSTFLAGS='-Ctarget-cpu=native'

Fixes the problem. After that, the results are:

test tests::bench_ln              ... bench:          43 ns/iter (+/- 3)
test tests::bench_rnd             ... bench:           5 ns/iter (+/- 0)

I think 38 (± 3) is close enough to 31.555 (± 0.234).

Shepmaster
  • 388,571
  • 95
  • 1,107
  • 1,366
10

I'm going to provide the other half of the explanation since I don't know Rust. Math.log is annotated with @HotSpotIntrinsicCandidate meaning that it will be replaced by a native CPU instruction for such an operation: think Integer.bitCount that would either do a lot of shifting or use a direct CPU instruction that does that much faster.

Having an extremely simple program like this:

public static void main(String[] args) {
    System.out.println(mathLn(20_000));
}

private static long mathLn(int x) {
    long result = 0L;
    for (int i = 0; i < x; ++i) {
        result = result + ln(i);
    }
    return result;
}

private static final long ln(int x) {
    return (long) Math.log(x);
}

And running it with:

 java -XX:+UnlockDiagnosticVMOptions  
      -XX:+PrintInlining 
      -XX:+PrintIntrinsics 
      -XX:CICompilerCount=2 
      -XX:+PrintCompilation  
      package/Classname 

It will generate a lot of lines, but one of them is:

 @ 2   java.lang.Math::log (5 bytes)   intrinsic

making this code extremely fast.

I don't really know when and how that happens in Rust though...

Shepmaster
  • 388,571
  • 95
  • 1,107
  • 1,366
Eugene
  • 117,005
  • 15
  • 201
  • 306
  • 12
    Since Rust is statically (or AOT, if you'd like) compiled, it has to know a single platform to compile to. By default, it will be kind of conservative (32-bit x86 code might target the 686 processor, for example). The `-Ctarget-cpu=native` flag tells the compiler to target the machine that the compiler is running on; this allows the compiler to use the full set of available instructions (like your `popcnt` example). – Shepmaster Jul 12 '17 at 15:46