I recently viewed this answer discussing pipelining. The question asked why a loop summing two lists to two separate variables was faster than xor-ing the same lists all to one variable. The linked answer concluded that the sums could be run in parallel, while each xor had to be computed consecutively, thus producing the seen effect.
I do not understand. Doesn't efficient parallelization require multiple threads? How can these additions be run in parallel on only one thread?
Additionally, if the compiler is so smart that it can magick in a whole new thread, why can't it just create two variables in the second function, execute the xor-s in parallel, and then xor the two variables back together after the loop terminates? To any human, such an optimization would be obvious. Is it harder to program such an optimization into the compiler than I realize?
Any explanation would be greatly appreciated!