C++ Parallelization Without Threads?

Question

I recently viewed this answer discussing pipelining. The question asked why a loop summing two lists to two separate variables was faster than xor-ing the same lists all to one variable. The linked answer concluded that the sums could be run in parallel, while each xor had to be computed consecutively, thus producing the seen effect.

I do not understand. Doesn't efficient parallelization require multiple threads? How can these additions be run in parallel on only one thread?

Additionally, if the compiler is so smart that it can magick in a whole new thread, why can't it just create two variables in the second function, execute the xor-s in parallel, and then xor the two variables back together after the loop terminates? To any human, such an optimization would be obvious. Is it harder to program such an optimization into the compiler than I realize?

Any explanation would be greatly appreciated!

CPU's, since many years, have instructions that allow them to do calculations on multiple data simultaneously, in effect in parallel. Perhaps that's what is being talked about? — Some programmer dude, Jul 24 '23 at 12:04
[SIMD (Single instruction multiple data)](https://fr.wikipedia.org/wiki/Single_instruction_multiple_data). — Jarod42, Jul 24 '23 at 12:12
No there are more ways to parallelize, e.g. SIMD. Like using [std::for_each](https://en.cppreference.com/w/cpp/algorithm/for_each), with [std::unsequenced_policy](https://en.cppreference.com/w/cpp/algorithm/execution_policy_tag_t). But in general compilers are pretty good at picking up SIMD too. You might want to watch this [What has my compiler done for me lately](https://www.youtube.com/watch?v=bSkpMdDe4g4). Then there is instruction pipelining, branch prediction etc... — Pepijn Kramer, Jul 24 '23 at 12:22

PoneyUHC · Accepted Answer · 2023-07-24T13:14:30.047

CPUs are made of a pipeline. Multiple operations may do various stuff (decode instruction, evaluate, do some calculations, read/write central memory, read/write registers, ...), and all this stuff must be done one after the other for each instruction. There can be various optimizations so that this pipeline does the job in a more efficient way.

So in fact, multiple instructions are processed at the same time by the CPU, but only one instruction is using a specific part of the pipeline. The pipeline concept also introduces various error-prone pattern, such as a read-after-write operation, but there are ways to deal with it (e.g nop instructions)

This is nothing relative to multithreading, which is a higher level concept. Here, we are at the lower point, i.e how the CPU executes instructions. The provided link in the thread you pinned is a nice starting point (link)

What did you mean by "Last years CPUs"? Most general purpose CPUs this millenium pipeline instructions, it's not just since last year. — Pete Kirkham, Jul 24 '23 at 12:34
I wrote last years because I know how a nowadays CPU works, but I have no clue how it did 20 years ago. Edited :) — PoneyUHC, Jul 24 '23 at 13:14

C++ Parallelization Without Threads?

1 Answers1