I want sort an array in shared memory parallelly without exiting the kernel.
I can sort an array in global memory using Thrust for CUDA . But that can done be done only in the host . I would have to exit the kernel for it. but it would mean that i would lose all the local memory in my thread when i relaunch another kernel i would have to refill the local memory .
Are there any libraries to this ? Or is there anyway i would pass the kernel and come to host and use thrust to sort the array in device and then resume the kernel ?