CUDA parallel sorting algorithm vs single thread sorting algorithms

Question

I have a large amount of data which i need to sort, several million array each with tens of thousand of values. What im wondering is the following:

Is it better to implement a parallel sorting algorithm, on the GPU, and run it across all the arrays

OR

implement a single thread algorithm, like quicksort, and assign each thread, of the GPU, a different array.

Obviously speed is the most important factor. For single thread sorting algorithm memory is a limiting factor. Ive already tried to implement a recursive quicksort but it doesnt seem to work for large amounts of data so im assuming there is a memory issue.

Data type to be sorted is long, so i dont believe a radix sort would be possible due to the fact that it a binary representation of the numbers would be too long.

Any pointers would be appreciated.

@JackOLantern Thanks for the link. I guess i wasnt clear, both options will be run on the GPU. — Hans Rudel, Jul 13 '13 at 17:14
@HansRudel are all of those arrays of the same length? Even if the are of same length, if you want to use the thread per array approach, you should use an algorithm that always executes the same steps regardless of how (un)sorted the data is - you will want to avoid thread divergence because either of differences in array length or differences in sorting algorithm execution for different data — RoBiK, Jul 13 '13 at 18:19
@RoBiK Hey bro, yeah they are of differing lengths. Ive finally got a quicksort algorithm written and working on a CPU but i havent tried it on the GPU. Times for the CPU version are 6 seconds for 16,000 values which isnt very good. So if ive understood ur comment, i should generate all the lists first, then implement a parallel version and sort a single row at a time? — Hans Rudel, Jul 13 '13 at 18:24

Robert Crovella · Accepted Answer · 2013-07-14T01:26:48.317

Sorting is an operation that has received a lot of attention. Writing your own sort isn't advisable if you are interested in high performance. I would consider something like thrust, back40computing, moderngpu, or CUB for sorting on the GPU.

Most of the above will be handling an array at a time, using the full GPU to sort an array. There are techniques within thrust to do a vectorized sort which can handle multiple arrays "at once", and CUB may also be an option for doing a "per-thread" sort (let's say, "per thread block").

Generally I would say the same thing about CPU sorting code. Don't write your own.

EDIT: I guess one more comment. I would lean heavily towards the first approach you mention (i.e. not doing a sort per thread.) There are two related reasons for this:

Most of the fast sorting work has been done along the lines of your first method, not the second.
The GPU is generally better at being fast when the work is well adapted for SIMD or SIMT. This means we generally want each thread to be doing the same thing and minimizing branching and warp divergence. This is harder to achieve (I think) in the second case, where each thread appears to be following the same sequence but in fact data dependencies are causing "algorithm divergence". On the surface of it, you might wonder if the same criticism might be levelled at the first approach, but since these libraries I mention arer written by experts, they are aware of how best to utilize the SIMT architecture. The thrust "vectorized sort" and CUB approaches will allow multiple sorts to be done per operation, while still taking advantage of SIMT architecture.

thanks very much for the links. Im not 100% sure if these are supported by CUDAFY but i will dig around n see. Its taken several hrs to get quicksort working properly so ur comment "Don't write your own." is sound advice esp since quick sort is suppose to be an easy algorithm. Thanks again for your help, much appreciated. — Hans Rudel, Jul 14 '13 at 00:48
I think none of them are managed code, so using them would be more challenging (from Cudafy/C#). If you really want max sort performance, however, I think it might make sense to consider building an unmanaged DLL or something like that, i.e. stepping outside of your Cudafy sandbox. I had a project I was working on where sort perf was critical, and in the end, by incorporating the modernGPU stuff, I was able to get ~2x even over thrust performance. It might make sense to not require *everything* to be cudafy-ready. But yes, it is more work. — Robert Crovella, Jul 14 '13 at 01:19
Would it be possible to speak to you in one of the chatrooms please? This is the c# room but we can make a sep one if its busy. http://chat.stackoverflow.com/rooms/7/c — Hans Rudel, Jul 15 '13 at 16:42

CUDA parallel sorting algorithm vs single thread sorting algorithms

1 Answers1