Can multiple processes hide latency of SSE instructions?

Question

I'm in need of high-performance merging and came accross: Efficient Implementation of Sorting on Multi-Core SIMD CPU Architecture by Jatin Chhugani et al.

Their aim is to get the most performance out of 1 CPU, one part of their solution is to use a bitonic sorting network on SIMD level. To hide latency of the min/max and shuffle operations they perform 4 sorting networks simultaneously (though I think they meant interleaved.). This gives up to a claimed 3.25x increase of performance.

My problem is somewhat relaxed, I have multiple pairs of arrays which need to be processed (read independent) so I can simply run multiple processes and thus easily gain higher throughput.

Though if I oversubscribe the amount of processes to available cores, does this hide latency as well? but induced on a higher level? Or are we treading here in the realm of hyperthreading and I'll never pass the limit of 2 processes sharing the same functional units in a CPU-core?

I could of course try, but changing the existing code is rather involved and I'd like to hear theories first.

It's not clear exactly what your question is. In regards to hyper-treading it is a technique to increase the effective ILP of a single core. If you don't optimize your code well then hyper-threading may improve the ILP. But if optimize so that you're getting about as much ILP as possible then hyper-treading will hurt performance. In practice most code is not optimized to get the most ILP so in many cases hyper-treading helps. — Z boson, Nov 19 '14 at 11:38

score 2 · Answer 1 · answered Nov 19 '14 at 13:53

I've done some experiments with this, and the benefit of HT seems to be marginal - on the one hand you see some small improvements from hiding latency, but on the other hand you double the pressure on cache usage and FSB bandwidth (and double the memory contention too). In some cases I've seen a small gain, in others a small reduction in performance - it all depends on memory access pattern and cache footprint, but from what I've seen HT doesn't really help much overall.

Having said that, there may be cases for code that isn't particularly well optimised as far as memory access patterns are concerned, where HT might buy you something, but if you haven't optimised usage of the the cache/memory hierarchy then SSE optimisation is probably premature anyway.

score 1 · Accepted Answer · answered Nov 19 '14 at 10:15

1

No, threading is not an effective solution to pipeline bubbles. The granularity doesn't fit: Context switching takes hundreds of cycles, whereas the sort of stall caused by a naive implementation of bitonic sorting is in 2-4 cycle pieces.

With that said, it's not clear what your use-case is, or where the bottleneck will turn out to be, so multiprocessing could help. Only one way to find out.

answered Nov 19 '14 at 10:15

Sneftel

40,271
12
71
104

Ok, clear. Further, I admit I'm no hardware guru, but does hyperthreading allow 2 running processes to be pipelined within the needed granularity? (knowing that it doesn't solve my problem, because I would need hyperthreading with support of 4 context engines) – hbogert Nov 19 '14 at 10:30
1

It's possible, depending on how aggressive the instruction scheduler is. But if not done very carefully, any wins from reduced stalls would be eclipsed by cache thrashing. – Sneftel Nov 19 '14 at 12:38

Can multiple processes hide latency of SSE instructions?

2 Answers2