withTaskGroup runs slower on multiple cores

Question

strange phenomenon. I'm doing some particle simulation, doing for n particles n*n computations:

private func computeVelocitiesSingle() async {
    // over all particles
    (0 ..< self.particles.count).forEach { i in
        let p = self.particles[i]
        // over all other paricles
        self.particles.forEach { q in
            // doing some computation                   
            p.vx += f * dx
            p.vy += f * dy
        }
    }
}

This codes run on a single core for n = 400 in 38ms:

times: 38764 73 38  0.038759083 seconds

The first number is the time in µs, the other numers are consecutive, applying computations in µs. Having more than one core (10 on a M1max), I'd like to split the computations in parallel threads, where I compute slices of particles (see the results after the code):

private func computeVelocities() async {
    await withTaskGroup(of: Void.self) { group in
        self.taskSlices.forEach { (i0, i1) in
            group.addTask {
                let clock = ContinuousClock()
                let elapsed = await clock.measure {
                    (i0 ..< i1).forEach { i in
                        let p = self.particles[i]
                        // all other particles
                        self.particles.forEach { q in
                            // doing some computation
                            // updating the 'outer' particle
                            p.vx += f * dx
                            p.vy += f * dy
                        }
                    }
                print("\(i0) ..< \(i1) -> \(elapsed)")
            }
        }
    }
}

Here the results using different amount of cores, measuring the execution time of each slices and the total amount (as before) on the last line:

1 core
0 ..< 400 -> 0.032809417 seconds

times: 32856 66 30  0.032855083 seconds

2 cores
200 ..< 400 -> 0.046012959 seconds
0 ..< 200 -> 0.0460945 seconds

times: 46119 69 30  0.046118042 seconds

4 cores
100 ..< 200 -> 0.058119167 seconds
300 ..< 400 -> 0.058544708 seconds
0 ..< 100 -> 0.066003292 seconds
200 ..< 300 -> 0.0668385 seconds

times: 66924 77 31  0.066921667 seconds

8 cores
50 ..< 100 -> 0.123520042 seconds
100 ..< 150 -> 0.141902875 seconds
350 ..< 400 -> 0.144477834 seconds
150 ..< 200 -> 0.144727458 seconds
200 ..< 250 -> 0.145149042 seconds
0 ..< 50 -> 0.146575167 seconds
250 ..< 300 -> 0.146813583 seconds
300 ..< 350 -> 0.14681375 seconds

times: 146921 67 30  0.146919833 seconds

10 cores
240 ..< 280 -> 0.116961667 seconds
280 ..< 320 -> 0.117469083 seconds
80 ..< 120 -> 0.119686875 seconds
120 ..< 160 -> 0.122491 seconds
360 ..< 400 -> 0.117837709 seconds
160 ..< 200 -> 0.125832917 seconds
200 ..< 240 -> 0.128698208 seconds
40 ..< 80 -> 0.129306125 seconds
0 ..< 40 -> 0.130126042 seconds
320 ..< 360 -> 0.128898833 seconds

times: 130183 69 30  0.130182 seconds

Added: In the comments below synchronization overhead was mentioned, therefore I increased the particles to n = 4000. Here are the results.

Regular thread: times: 3647994 1779 286 3.647961084 seconds

1 core:
0 ..< 4000 -> 3.267147417 seconds
times: 3267229 1763 262  3.267201417 seconds

2000 ..< 4000 -> 4.005799125 seconds
0 ..< 2000 -> 4.010347584 seconds
times: 4010190 1933 279  4.010372625 seconds

8 cores:
500 ..< 1000 -> 14.019210832999999 seconds
1500 ..< 2000 -> 14.035509583000001 seconds
3500 ..< 4000 -> 14.059028875000001 seconds
2000 ..< 2500 -> 14.064966416999999 seconds
2500 ..< 3000 -> 14.072029209 seconds
3000 ..< 3500 -> 14.073866667 seconds
1000 ..< 1500 -> 14.177985875000001 seconds
0 ..< 500 -> 14.180926750000001 seconds
times: 14181089 1696 270  14.180976042 seconds

Each multiple core execution needs the execution time of the longest execution time out of all parallel threads. I've checked the Activity Monitor and the process is running at about 900% CPU, so all cores are active doing computations. Any ideas?

We need [MRE](https://stackoverflow.com/help/minimal-reproducible-example). — Rob, Aug 13 '23 at 12:57
Parallel execution introduces a little overhead and parallel calculations are often slower if there is not enough work on each core. (The overhead of concurrency can easily outweigh any gains achieved through parallelism.) Especially if there is any synchronization needed or any resource contention. Frankly if single threaded calculation only takes 38 msec, it is not entirely surprising that parallel rendition takes longer. Perhaps this might be candidate for GPU approach. — Rob, Aug 13 '23 at 13:12
The threads need 4 times longer for execution for a tenth of work and the Activity Monitors showed a 100% workload on each core. This is quite a lot for some sort of overhead. — osx, Aug 13 '23 at 13:48
I hear you. But, the multithreading (and synchronization) overhead actually is modest, but when you're talking about a calculation that takes a few milliseconds, it will dwarf any gains from parallelism. You can try `concurrentPerform`, which has a little less overhead, IIRC, but with so little work on each core, you are still going to see a ton of overhead (as % of the overall execution time). I generally only see significant parallelism benefits if the calculation for each loop is takes much longer (e.g., seconds, not msec). E.g., striding through rows of pixels in a multi-megabyte image. — Rob, Aug 13 '23 at 20:36
FWIW, the fact that 400 iterations single-threaded is faster than each of the 200 iterations in the two core scenario might suggest that there is some requisite synchronization that is slowing it down. There’s not enough here for us to diagnose it, though. We need a MRE to assist further. — Rob, Aug 13 '23 at 22:43
@Rob Regarding the overhead: I increased the number of particles to n = 4000, the results are basically the same (except for the longer execution time). Interestingly, the single core withTaskGroupruns faster as the 'regular' code, see my additional text in the question. I'm working on a MRE. — osx, Aug 14 '23 at 04:52
@Rob The development of the MRE showed a different behavior. It turned out, the calls of the computing function were interleaved, i. e. more than one cycle was processed the same time. Now the MRE runs with more cores faster than on a single thread, but the decrease in execution time is not linear with the number of cores; so there is still an overhead. The original source code still shows the described behavior, even without interleaving now. I'll try to figure out, which part of the code slows down the parallel execution. — osx, Aug 17 '23 at 06:29

withTaskGroup runs slower on multiple cores

0 Answers0