You shouldn’t really benchmark in playgrounds since they’re unoptimized. Unless you’re interested in how long things will take when you’re debugging, you should only ever benchmark optimized builds (swiftc -O
).
To understand why a range-based loop can be faster, you can look at the assembly generated for the two options:
Range-based
% echo "for i in 0..<4_000 { println(i) }" | swiftc -O -emit-assembly -
; snip opening boiler plate...
LBB0_1:
movq %rbx, -32(%rbp)
; increment i
incq %rbx
movq %r14, %rdi
movq %r15, %rsi
; print (pre-incremented) i
callq __TFSs7printlnU__FQ_T_
; compare i to 4_000
cmpq $4000, %rbx
; loop if not equal
jne LBB0_1
xorl %eax, %eax
addq $8, %rsp
popq %rbx
popq %r14
popq %r15
popq %rbp
retq
.cfi_endproc
C-style for
loop
% echo "for var i = 0;i < 4_000;++i { println(i) }" | swiftc -O -emit-assembly -
; snip opening boiler plate...
LBB0_1:
movq %rbx, -32(%rbp)
movq %r14, %rdi
movq %r15, %rsi
; print i
callq __TFSs7printlnU__FQ_T_
; increment i
incq %rbx
; jump if overflow
jo LBB0_4
; compare i to 4_000
cmpq $4000, %rbx
; loop if less than
jl LBB0_1
xorl %eax, %eax
addq $8, %rsp
popq %rbx
popq %r14
popq %r15
popq %rbp
retq
LBB0_4:
; raise illegal instruction due to overflow
ud2
.cfi_endproc
So the reason the C-style loop is slower is because it’s performing an extra operation – checking for overflow. Either Range
was written to avoid the overflow check (or do it up front), or the optimizer was more able to eliminate it with the Range
version.
If you switch to using the check-free addition operator, you can eliminate this check. This produces near-identical code to the range-based version (the only difference being some immaterial ordering of the code):
% echo "for var i = 0;i < 4_000;i = i &+ 1 { println(i) }" | swiftc -O -emit-assembly -
; snip
LBB0_1:
movq %rbx, -32(%rbp)
movq %r14, %rdi
movq %r15, %rsi
callq __TFSs7printlnU__FQ_T_
incq %rbx
cmpq $4000, %rbx
jne LBB0_1
xorl %eax, %eax
addq $8, %rsp
popq %rbx
popq %r14
popq %r15
popq %rbp
retq
.cfi_endproc
Never Benchmark Unoptimized Builds
If you want to understand why, try looking at the output for the Range
-based version of the above, but with no optimization: echo "for var i = 0;i < 4_000;++i { println(i) }" | swiftc -Onone -emit-assembly -
. You will see it output a lot more code. That’s because Range
used via for…in
is an abstraction, a struct used with custom operators and functions returning generators, and does a lot of safety checks and other helpful things. This makes it a lot easier to write/read code. But when you turn on the optimizer, all this disappears and you’re left with very efficient code.
Benchmarking
As to ways to benchmark, this is the code I tend to use, just replacing the array:
import CoreFoundation.CFDate
func timeRun<T>(name: String, f: ()->T) -> String {
let start = CFAbsoluteTimeGetCurrent()
let result = f()
let end = CFAbsoluteTimeGetCurrent()
let timeStr = toString(Int((end - start) * 1_000_000))
return "\(name)\t\(timeStr)µs, produced \(result)"
}
let n = 4_000
let runs: [(String,()->Void)] = [
("for in range", {
for i in 0..<n { println(i) }
}),
("plain ol for", {
for var i = 0;i < n;++i { println(i) }
}),
("w/o overflow", {
for var i = 0;i < n;i = i &+ 1 { println(i) }
}),
]
println("\n".join(map(runs, timeRun)))
But the results will probably be meaningless, since jitter during println
will likely obscure actual measurement. To really benchmark (assuming you don’t just trust the assembly analysis :) you’d need to replace it with something very lightweight.