Swift performance: map() and reduce() vs for loops

Question

I'm writing some performance-critical code in Swift. After implementing all the optimizations I could think of, and profiling the application in Instruments, I came to realize that the vast majority of CPU cycles are spent performing map() and reduce() operations on arrays of Floats. So, just to see what would happen, I replaced all instances of map and reduce with good old-fashioned for loops. And to my amazement... the for loops were much, much faster!

A bit puzzled by this, I decided to perform some rough benchmarks. In one test, I had map return an array of Floats after performing some simple arithmetic like so:

// Populate array with 1,000,000,000 random numbers
var array = [Float](count: 1_000_000_000, repeatedValue: 0)
for i in 0..<array.count {
    array[i] = Float(random())
}
let start = NSDate()
// Construct a new array, with each element from the original multiplied by 5
let output = array.map({ (element) -> Float in
    return element * 5
})
// Log the elapsed time
let elapsed = NSDate().timeIntervalSinceDate(start)
print(elapsed)

And the equivalent for loop implementation:

var output = [Float]()
for element in array {
    output.append(element * 5)
}

Average execution time for map: 20.1 seconds. Average execution time for the for loop: 11.2 seconds. Results were similar using Integers instead of Floats.

I created a similar benchmark to test the performance of Swift's reduce. This time, reduce and for loops achieved nearly the same performance when summing the elements of one large array. But when I loop the test 100,000 times like this:

// Populate array with 1,000,000 random numbers
var array = [Float](count: 1_000_000, repeatedValue: 0)
for i in 0..<array.count {
    array[i] = Float(random())
}
let start = NSDate()
// Perform operation 100,000 times
for _ in 0..<100_000 {
    let sum = array.reduce(0, combine: {$0 + $1})
}
// Log the elapsed time
let elapsed = NSDate().timeIntervalSinceDate(start)
print(elapsed)

vs:

for _ in 0..<100_000 {
    var sum: Float = 0
    for element in array {
        sum += element
    }
}

The reduce method takes 29 seconds while the for loop takes (apparently) 0.000003 seconds.

Naturally I'm ready to disregard that last test as the result of a compiler optimization, but I think it may give some insight into how the compiler optimizes differently for loops vs Swift's built-in array methods. Note that all tests were performed with -Os optimization on a 2.5 GHz i7 MacBook Pro. Results varied depending on array size and number of iterations, but for loops always outperformed the other methods by at least 1.5x, sometimes up to 10x.

I'm a bit perplexed about Swift's performance here. Shouldn't the built-in Array methods be faster than the naive approach for performing such operations? Maybe somebody with more low-level knowledge than I can shed some light on the situation.

Most probably, the compiler realizes that in your last example, the result of the summation is not used at all and removes the entire loop. Printing the sum after the loop should make a difference. — Martin R, Nov 17 '15 at 07:08
Good idea - that definitely slowed it down. Although honestly, in my experience calling print() that many times is incredibly slow anyway, so it's hard to say what difference it makes. It's a good example of an optimization difference between the two methods though - seems like it should make the same conclusion about the reduce() loop too. — Hundley, Nov 17 '15 at 07:20
Maybe this article can give some insight on the performance differences between for-loops and reduces: http://airspeedvelocity.net/2015/08/03/arrays-linked-lists-and-performance/ — JDS, Nov 17 '15 at 18:26

score 40 · Accepted Answer · edited Nov 04 '20 at 18:36

Shouldn't the built-in Array methods be faster than the naive approach for performing such operations? Maybe somebody with more low-level knowledge than I can shed some light on the situation.

I just want to attempt to address this part of the question and more from the conceptual level (with little understanding of the nature of Swift's optimizer on my part) with a "not necessarily". It's coming more from a background in compiler design and computer architecture than deep-rooted knowledge of the nature of Swift's optimizer.

Calling Overhead

With functions like map and reduce accepting functions as inputs, it places a greater strain on the optimizer to put it one way. The natural temptation in such a case short of some very aggressive optimization is to constantly branch back and forth between the implementation of, say, map, and the closure you provided, and likewise transmit data across these disparate branches of code (through registers and stack, typically).

That kind of branching/calling overhead is very difficult for the optimizer to eliminate, especially given the flexibility of Swift's closures (not impossible but conceptually quite difficult). C++ optimizers can inline function object calls but with far more restrictions and code generation techniques required to do it where the compiler would effectively have to generate a whole new set of instructions for map for each type of function object you pass in (and with explicit aid of the programmer indicating a function template used for the code generation).

So it shouldn't be of great surprise to find that your hand-rolled loops can perform faster -- they put a great deal of less strain on the optimizer. I have seen some people cite that these higher-order functions should be able to go faster as a result of the vendor being able to do things like parallelize the loop, but to effectively parallelize the loop would first require the kind of information that would typically allow the optimizer to inline the nested function calls within to a point where they become as cheap as the hand-rolled loops. Otherwise the function/closure implementation you pass in is going to be effectively opaque to functions like map/reduce: they can only call it and pay the overhead of doing so, and cannot parallelize it since they cannot assume anything about the nature of the side effects and thread-safety in doing so.

Of course this is all conceptual -- Swift may be able to optimize these cases in the future, or it may already be able to do so now (see -Ofast as a commonly-cited way to make Swift go faster at the cost of some safety). But it does place a heavier strain on the optimizer, at the very least, to use these kinds of functions over the hand-rolled loops, and the time differences you're seeing in the first benchmark seem to reflect the kind of differences one might expect with this additional calling overhead. Best way to find out is to look at the assembly and try various optimization flags.

Standard Functions

That's not to discourage the use of such functions. They do more concisely express intent, they can boost productivity. And relying on them could allow your codebase to get faster in future versions of Swift without any involvement on your part. But they aren't necessarily always going to be faster -- it is a good general rule to think that a higher-level library function that more directly expresses what you want to do is going to be faster, but there are always exceptions to the rule (but best discovered in hindsight with a profiler in hand since it's far better to err on the side of trust than distrust here).

Artificial Benchmarks

As for your second benchmark, it is almost certainly a result of the compiler optimizing away code that has no side effects that affect user output. Artificial benchmarks have a tendency to be notoriously misleading as a result of what optimizers do to eliminate irrelevant side effects (side effects that don't affect user output, essentially). So you have to be careful there when constructing benchmarks with times that seem too good to be true that they aren't the result of the optimizer merely skipping all the work you actually wanted to benchmark. At the very least, you want your tests to output some final result gathered from the computation.

This is really informative - I had always read that the high-order functions should be faster by parallelizing the loops, but now I see why that might not always be the case. As a related question - do you know of any good resources for optimizing Swift code? I'm interested in gaining more low-level knowledge to improve my applications. — Hundley, Nov 17 '15 at 22:00
Afraid not -- I'm hardly a Swift expert. Beyond algorithms and parallelization, I'd tend to suggest more language-independent resources and a look at computer architecture, data-oriented design, memory layouts and cache-related optimizations. Those hold regardless of the language, since the hardware is the same. You already have the most important part down -- a profiler in hand and the ability to properly and precisely measure the code. The rest is probably hunting down those top hotspots and figuring out why they exist and how to tackle them, and work your way up from there. — , Nov 17 '15 at 22:06
@Hundley Another basic resource is compiler design -- Dragon Book (Alfred Aho, Jeffrey Ullman) and all that. It can really teach about you how compilers are both retarded in some ways and brilliant in other ways. It really helped me to read such books and understand the basic limitations, speaking today in 2020, of what compilers are brilliant at and what they're dumb at. As mere humans, we do best if we tackle their dumbness and don't try to compete with them with respect to what they're geniuses at. — , Mar 10 '20 at 04:57

Martin R · Answer 2 · 2015-11-17T08:10:52.707

I cannot say much about your first test (map() vs append() in a loop) but I can confirm your results. The append loop becomes even faster if you add

output.reserveCapacity(array.count)

after the array creation. It seems that Apple can improve things here and you might file a bug report.

In

for _ in 0..<100_000 {
    var sum: Float = 0
    for element in array {
        sum += element
    }
}

the compiler (probably) removes the entire loop because the computed results are not used at all. I can only speculate why a similar optimization does not happen in

for _ in 0..<100_000 {
    let sum = array.reduce(0, combine: {$0 + $1})
}

but it would more difficult to decide if calling reduce() with the closure has any side-effects or not.

If the test code is changed slightly to calculate and print a total sum

do {
    var total = Float(0.0)
    let start = NSDate()
    for _ in 0..<100_000 {
        total += array.reduce(0, combine: {$0 + $1})
    }
    let elapsed = NSDate().timeIntervalSinceDate(start)
    print("sum with reduce:", elapsed)
    print(total)
}

do {
    var total = Float(0.0)
    let start = NSDate()
    for _ in 0..<100_000 {
        var sum = Float(0.0)
        for element in array {
            sum += element
        }
        total += sum
    }
    let elapsed = NSDate().timeIntervalSinceDate(start)
    print("sum with loop:", elapsed)
    print(total)
}

then both variants take about 10 seconds in my test.

+1 for pointing out reserveCapacity() - that improves the speed by about 3x. From these tests it looks like reduce() might be a bit more optimized than map(). — Hundley, Nov 17 '15 at 21:05

score 15 · Answer 3 · answered Jul 02 '18 at 16:29

I did a quick set of performance tests measuring the performance of repeated transformations on an Array of Strings, and it showed that .map was much more performant than a for loop, by a factor of about 10x.

The results in the screenshot below show that chained transformations in a single map block outperform multiple maps with a single transformation in each, and any use of map out-performs for loops.

Code I used in a Playground:

import Foundation
import XCTest

class MapPerfTests: XCTestCase {
 var array =
  [
   "MyString",
   "MyString",
   "MyString",
   "MyString",
   "MyString",
   "MyString",
   "MyString",
   "MyString",
   "MyString",
   "MyString",
   "MyString",
   "MyString",
   "MyString",
   "MyString",
   "MyString",
   "MyString",
   "MyString",
   "MyString",
   "MyString",
   "MyString",
   "MyString",
   "MyString",
   "MyString",
   "MyString",
   "MyString",
   "MyString",
   "MyString",
   "MyString",
   "MyString",
   "MyString",
   "MyString",
   "MyString",
   "MyString",
   "MyString",
   "MyString",
   "MyString",
   "MyString",
   "MyString",
   "MyString",
   "MyString",
   "MyString",
   "MyString",
   "MyString",
   "MyString",
   "MyString",
   "MyString",
   "MyString",
   "MyString",
   "MyString",
   "MyString",
   "MyString",
   "MyString",
   "MyString",
   "MyString",
   "MyString",
   "MyString",
   "MyString",
   "MyString",
   "MyString",
   "MyString",
   "MyString",
   "MyString",
   "MyString",
   "MyString",
   "MyString",
   "MyString",
   "MyString",
   "MyString",
   "MyString",
   "MyString",
   "MyString",
   "MyString",
   "MyString",
   "MyString",
   "MyString",
   "MyString",
   "MyString",
   "MyString",
   "MyString",
   "MyString",
   "MyString",
   "MyString",
   "MyString",
   "MyString",
   "MyString"
 ]

 func testForLoopAllInOnePerf() {
  measure {
   var newArray: [String] = []
   for item in array {
    newArray.append(item.uppercased().lowercased().uppercased().lowercased())
   }
  }
 }

 func testForLoopMultipleStagesPerf() {
  measure {
   var newArray: [String] = []
   for item in array {
    let t1 = item.uppercased()
    let t2 = item.lowercased()
    let t3 = item.uppercased()
    let t4 = item.lowercased()
    newArray.append(t4)
   }
  }
 }

 func testMultipleMapPerf() {
  measure {
   let newArray = array
    .map( { $0.uppercased() } )
    .map( { $0.lowercased() } )
    .map( { $0.uppercased() } )
    .map( { $0.lowercased() } )
  }
 }

 func testSingleMapPerf() {
  measure {
   let newArray = array
    .map( { $0.uppercased().lowercased().uppercased().lowercased() } )
  }
 }
}

MapPerfTests.defaultTestSuite.run()

Playground doesn't run with optimized compilation. Measuring performances shall be done in an app. You can use an XCTest too. — Cœur, Oct 21 '22 at 09:39

Swift performance: map() and reduce() vs for loops

3 Answers3

Linked