Benchmark for loop and range operator

Question

I read that range-based loops have better performance on some programming language. Is it the case in Swift. For instance in Playgroud:

func timeDebug(desc: String, function: ()->() )
{
    let start : UInt64 = mach_absolute_time()
    function()
    let duration : UInt64 = mach_absolute_time() - start

    var info : mach_timebase_info = mach_timebase_info(numer: 0, denom: 0)
    mach_timebase_info(&info)

    let total = (duration * UInt64(info.numer) / UInt64(info.denom)) / 1_000
    println("\(desc): \(total) µs.")
}

func loopOne(){
    for i in 0..<4000 {
        println(i);
    }
}

func loopTwo(){
    for var i = 0; i < 4000; i++ {
        println(i);
    }
}

range-based loop

timeDebug("Loop One time"){
    loopOne(); // Loop One time: 2075159 µs.
}

normal for loop

timeDebug("Loop Two time"){
    loopTwo(); // Loop Two time: 1905956 µs.
}

How to properly benchmark in swift?

// Update on the device

First run

Loop Two time: 54 µs.

Loop One time: 482 µs.

Second

Loop Two time: 44 µs.

Loop One time: 382 µs.

Third

Loop Two time: 43 µs.

Loop One time: 419 µs.

Fourth

Loop Two time: 44 µs.

Loop One time: 399 µs.

// Update 2

    func printTimeElapsedWhenRunningCode(title:String, operation:()->()) {
        let startTime = CFAbsoluteTimeGetCurrent()
        operation()
        let timeElapsed = CFAbsoluteTimeGetCurrent() - startTime
        println("Time elapsed for \(title): \(timeElapsed) s")
    }


    printTimeElapsedWhenRunningCode("Loop Two time") {
        loopTwo(); // Time elapsed for Loop Two time: 4.10079956054688e-05 s
    }

    printTimeElapsedWhenRunningCode("Loop One time") {
        loopOne(); // Time elapsed for Loop One time: 0.000500023365020752 s.
    }

You have observed that the "normal" for loop is faster than a range-based for loop (which I can confirm). But what is your question? — Martin R, May 16 '15 at 10:06
In the Playground range-based for loop is slower, but on the device is faster. Check update — grape1, May 16 '15 at 10:09
Sorry, I don't get it. All your numbers show that "loop one" (which is the range-based loop) is slower. In any case: what is your question? Are you asking *why* one is faster or slower? Or are you assuming that your benchmarking method is wrong? — Martin R, May 16 '15 at 10:14
Yes I thing the benchmarking method is wrong because the original code divide by 1_000_000, but that returns 0 for the both loops. — grape1, May 16 '15 at 10:16
Integer division *truncates* the result to an integer, so if the time is less than one millisecond (= 1,000,000 nanoseconds) then division by 1_000_000 will give zero. Change your code to `let totalMilliSeconds = (Double(duration) * Double(info.numer) / Double(info.denom)) / 1_000_000` to avoid that problem. — Martin R, May 16 '15 at 10:21
Using your code returns Loop Two time: 0.0589583333333333 µs. Loop One time: 0.354875 µs. but using different benchmark method the result is different — grape1, May 16 '15 at 10:28

Airspeed Velocity · Accepted Answer · 2015-05-16T13:31:05.673

You shouldn’t really benchmark in playgrounds since they’re unoptimized. Unless you’re interested in how long things will take when you’re debugging, you should only ever benchmark optimized builds (swiftc -O).

To understand why a range-based loop can be faster, you can look at the assembly generated for the two options:

Range-based

% echo "for i in 0..<4_000 { println(i) }" | swiftc -O -emit-assembly -
; snip opening boiler plate...
LBB0_1:
    movq    %rbx, -32(%rbp)
; increment i
    incq    %rbx
    movq    %r14, %rdi
    movq    %r15, %rsi
; print (pre-incremented) i
    callq   __TFSs7printlnU__FQ_T_
; compare i to 4_000
    cmpq    $4000, %rbx
; loop if not equal
    jne LBB0_1
    xorl    %eax, %eax
    addq    $8, %rsp
    popq    %rbx
    popq    %r14
    popq    %r15
    popq    %rbp
    retq
    .cfi_endproc

C-style for loop

% echo "for var i = 0;i < 4_000;++i { println(i) }" | swiftc -O -emit-assembly -
; snip opening boiler plate...
LBB0_1:
    movq    %rbx, -32(%rbp)
    movq    %r14, %rdi
    movq    %r15, %rsi
; print i
    callq   __TFSs7printlnU__FQ_T_
; increment i
    incq    %rbx
; jump if overflow
    jo  LBB0_4
; compare i to 4_000
    cmpq    $4000, %rbx
; loop if less than
    jl  LBB0_1
    xorl    %eax, %eax
    addq    $8, %rsp
    popq    %rbx
    popq    %r14
    popq    %r15
    popq    %rbp
    retq
LBB0_4:
; raise illegal instruction due to overflow
    ud2
    .cfi_endproc

So the reason the C-style loop is slower is because it’s performing an extra operation – checking for overflow. Either Range was written to avoid the overflow check (or do it up front), or the optimizer was more able to eliminate it with the Range version.

If you switch to using the check-free addition operator, you can eliminate this check. This produces near-identical code to the range-based version (the only difference being some immaterial ordering of the code):

% echo "for var i = 0;i < 4_000;i = i &+ 1 { println(i) }" | swiftc -O -emit-assembly -
; snip
LBB0_1:
    movq    %rbx, -32(%rbp)
    movq    %r14, %rdi
    movq    %r15, %rsi
    callq   __TFSs7printlnU__FQ_T_
    incq    %rbx
    cmpq    $4000, %rbx
    jne LBB0_1
    xorl    %eax, %eax
    addq    $8, %rsp
    popq    %rbx
    popq    %r14
    popq    %r15
    popq    %rbp
    retq
    .cfi_endproc

Never Benchmark Unoptimized Builds

If you want to understand why, try looking at the output for the Range-based version of the above, but with no optimization: echo "for var i = 0;i < 4_000;++i { println(i) }" | swiftc -Onone -emit-assembly -. You will see it output a lot more code. That’s because Range used via for…in is an abstraction, a struct used with custom operators and functions returning generators, and does a lot of safety checks and other helpful things. This makes it a lot easier to write/read code. But when you turn on the optimizer, all this disappears and you’re left with very efficient code.

Benchmarking

As to ways to benchmark, this is the code I tend to use, just replacing the array:

import CoreFoundation.CFDate

func timeRun<T>(name: String, f: ()->T) -> String {
    let start = CFAbsoluteTimeGetCurrent()
    let result = f()
    let end = CFAbsoluteTimeGetCurrent()
    let timeStr = toString(Int((end - start) * 1_000_000))
    return "\(name)\t\(timeStr)µs, produced \(result)"
}

let n = 4_000

let runs: [(String,()->Void)] = [
    ("for in range", {
        for i in 0..<n { println(i) }    
    }),
    ("plain ol for", {
        for var i = 0;i < n;++i { println(i) }    
    }),
    ("w/o overflow", {
        for var i = 0;i < n;i = i &+ 1 { println(i) }    
    }),
]

println("\n".join(map(runs, timeRun)))

But the results will probably be meaningless, since jitter during println will likely obscure actual measurement. To really benchmark (assuming you don’t just trust the assembly analysis :) you’d need to replace it with something very lightweight.

That's interesting. In your tests, a range-based loop is faster, while OP claims that a C-style loop is faster. I actually get very different results, depending on the number of iterations, and also depending on which one (loopOne or loopTwo) is measured first. — Martin R, May 16 '15 at 12:07
For small `n`, LLVM will unroll some loops as well. I deleted the benchmark since multiple tests showed it was very variable – mainly because the `println` dwarfs the the time of the loop itself. A true benchmark would need to do something more lightweight inside the loop (but enough to dissuade the optimizer from just deleting it!). I suspect the OP was testing in a playground, which is essentially meaningless, what you’re benchmarking is really the playground magic. — Airspeed Velocity, May 16 '15 at 12:13
Also, in unoptimized builds, the `Range` object will really exist. In optimized builds, structs often disappear and are replaced by registers. Just reinforces the point, it really doesn’t make sense to benchmark `-Onone` — Airspeed Velocity, May 16 '15 at 12:16
Sure. I replaced `println()` by a call to a global function `foo(i)` which modifies a global variable, and tested it in a compiled project with optimization (standard "Release" configuration). I had cases where the C-style loop was faster, but I cannot reproduce it anymore :) – The overflow checking is definitely a good point! — Martin R, May 16 '15 at 12:24

Benchmark for loop and range operator

1 Answers1

Never Benchmark Unoptimized Builds

Benchmarking