I have some strange results while testing OpenMP. As a test case, I sum two vectors of floats, a problem which should perfectly parallelizable.
For vectors large enough on my quad-core CPU with Hyper-threading, which essentially means that I should have 4x2 independent threads, I get almost a perfect speedup of factor two from a single-thread execution to a dual-thread one. Same story if I go from 4 threads to 8 threads, relative speedup of factor 2.
However, I get almost no speedup going from 2 to 4 threads. I could understand if it happened during the transition from 4 to 8 threads, maybe because Hyper-threading technology of pushing two logical threads into one physical core was imperfect. But on this intermediate stage it seems strange to me.
I would grateful for any ideas!