strange phenomenon. I'm doing some particle simulation, doing for n particles n*n computations:
private func computeVelocitiesSingle() async {
// over all particles
(0 ..< self.particles.count).forEach { i in
let p = self.particles[i]
// over all other paricles
self.particles.forEach { q in
// doing some computation
p.vx += f * dx
p.vy += f * dy
}
}
}
This codes run on a single core for n = 400 in 38ms:
times: 38764 73 38 0.038759083 seconds
The first number is the time in µs, the other numers are consecutive, applying computations in µs. Having more than one core (10 on a M1max), I'd like to split the computations in parallel threads, where I compute slices of particles (see the results after the code):
private func computeVelocities() async {
await withTaskGroup(of: Void.self) { group in
self.taskSlices.forEach { (i0, i1) in
group.addTask {
let clock = ContinuousClock()
let elapsed = await clock.measure {
(i0 ..< i1).forEach { i in
let p = self.particles[i]
// all other particles
self.particles.forEach { q in
// doing some computation
// updating the 'outer' particle
p.vx += f * dx
p.vy += f * dy
}
}
print("\(i0) ..< \(i1) -> \(elapsed)")
}
}
}
}
Here the results using different amount of cores, measuring the execution time of each slices and the total amount (as before) on the last line:
1 core
0 ..< 400 -> 0.032809417 seconds
times: 32856 66 30 0.032855083 seconds
2 cores
200 ..< 400 -> 0.046012959 seconds
0 ..< 200 -> 0.0460945 seconds
times: 46119 69 30 0.046118042 seconds
4 cores
100 ..< 200 -> 0.058119167 seconds
300 ..< 400 -> 0.058544708 seconds
0 ..< 100 -> 0.066003292 seconds
200 ..< 300 -> 0.0668385 seconds
times: 66924 77 31 0.066921667 seconds
8 cores
50 ..< 100 -> 0.123520042 seconds
100 ..< 150 -> 0.141902875 seconds
350 ..< 400 -> 0.144477834 seconds
150 ..< 200 -> 0.144727458 seconds
200 ..< 250 -> 0.145149042 seconds
0 ..< 50 -> 0.146575167 seconds
250 ..< 300 -> 0.146813583 seconds
300 ..< 350 -> 0.14681375 seconds
times: 146921 67 30 0.146919833 seconds
10 cores
240 ..< 280 -> 0.116961667 seconds
280 ..< 320 -> 0.117469083 seconds
80 ..< 120 -> 0.119686875 seconds
120 ..< 160 -> 0.122491 seconds
360 ..< 400 -> 0.117837709 seconds
160 ..< 200 -> 0.125832917 seconds
200 ..< 240 -> 0.128698208 seconds
40 ..< 80 -> 0.129306125 seconds
0 ..< 40 -> 0.130126042 seconds
320 ..< 360 -> 0.128898833 seconds
times: 130183 69 30 0.130182 seconds
Added: In the comments below synchronization overhead was mentioned, therefore I increased the particles to n = 4000. Here are the results.
Regular thread: times: 3647994 1779 286 3.647961084 seconds
1 core:
0 ..< 4000 -> 3.267147417 seconds
times: 3267229 1763 262 3.267201417 seconds
2000 ..< 4000 -> 4.005799125 seconds
0 ..< 2000 -> 4.010347584 seconds
times: 4010190 1933 279 4.010372625 seconds
8 cores:
500 ..< 1000 -> 14.019210832999999 seconds
1500 ..< 2000 -> 14.035509583000001 seconds
3500 ..< 4000 -> 14.059028875000001 seconds
2000 ..< 2500 -> 14.064966416999999 seconds
2500 ..< 3000 -> 14.072029209 seconds
3000 ..< 3500 -> 14.073866667 seconds
1000 ..< 1500 -> 14.177985875000001 seconds
0 ..< 500 -> 14.180926750000001 seconds
times: 14181089 1696 270 14.180976042 seconds
Each multiple core execution needs the execution time of the longest execution time out of all parallel threads. I've checked the Activity Monitor and the process is running at about 900% CPU, so all cores are active doing computations. Any ideas?