I'm in need of high-performance merging and came accross: Efficient Implementation of Sorting on Multi-Core SIMD CPU Architecture by Jatin Chhugani et al.
Their aim is to get the most performance out of 1 CPU, one part of their solution is to use a bitonic sorting network on SIMD level. To hide latency of the min/max and shuffle operations they perform 4 sorting networks simultaneously (though I think they meant interleaved.). This gives up to a claimed 3.25x increase of performance.
My problem is somewhat relaxed, I have multiple pairs of arrays which need to be processed (read independent) so I can simply run multiple processes and thus easily gain higher throughput.
Though if I oversubscribe the amount of processes to available cores, does this hide latency as well? but induced on a higher level? Or are we treading here in the realm of hyperthreading and I'll never pass the limit of 2 processes sharing the same functional units in a CPU-core?
I could of course try, but changing the existing code is rather involved and I'd like to hear theories first.