I have a list of randomly generated numbers, which contains 1900 numbers, and I want to obtain a sorted list of the top 190 numbers. I've written two versions of the partial sorting algorithm, 1st is a CPU version and 2nd is written so it can run on Cudafy.net. But there is a large difference in execution time between them, when run on the CPU, I was wondering if someone could shed some light on why, + is it possible to speed the 2nd version up further?
Note: the 2nd algorithm is going to be run on a GPU so I can't use linq or anything which wouldn't run on C as I will be using cudafy.net to run the code. Unfortunately cudafy.net also doesn't support jagged arrays.
Version 1:
/// <summary>
/// Sequentially runs through all the values in the array and identifies if
/// the current number is less than the highest number in the sorted list.
/// </summary>
/// <param name="numbers"> Unsorted array of numbers.</param>
/// <param name="sortedNumbers"> Array used to hold the partial list of sorted numbers.</param>
public static void NewSorter(int[] numbers, int[] sortedNumbers)
{
for (int i = 0; i < numbers.Length; i++)
{
if (sortedNumbers[sortedNumbers.Length - 1] > numbers[i])
{
//Update numbers
IdentifyPosition(sortedNumbers, numbers[i]);
}
}
}
/// <summary>
/// Identifies the position the number should be placed in the partial list of sorted numbers.
/// </summary>
/// <param name="sortedNumbers"> Array used to hold the partial list of sorted numbers.</param>
/// <param name="NewNumber"> Number to be inserted.</param>
static void IdentifyPosition(int[] sortedNumbers, int NewNumber)
{
for (int i = 0; i < sortedNumbers.Length; i++)
{
if (NewNumber < sortedNumbers[i])
{
//Offset and add.
ArrayShifter(sortedNumbers, i, NewNumber);
break;
}
}
}
/// <summary>
/// Moves all the elements to the right of a point up one and
/// then places the new number in the specified point.
/// </summary>
/// <param name="SortedNumbers"> Array used to hold the partial list of sorted numbers.</param>
/// <param name="position"> Position in the array where the new number should be place.</param>
/// <param name="NewNumber"> Number to include in the array.</param>
static void ArrayShifter(int[] SortedNumbers, int position, int NewNumber)
{
for (int i = SortedNumbers.Length - 1; i > position; i--)
{
SortedNumbers[i] = SortedNumbers[i - 1];
}
SortedNumbers[position] = NewNumber;
}
The above version executed in ~ 0.65 milliseconds.
Version 2:
/// <summary>
/// Sequentially runs through all the values in the array and identifies if
/// the current number is less than the highest number in the sorted list.
/// </summary>
/// <param name="unsortedNumbers"> Unsorted numbers.</param>
/// <param name="lookbackCount"> Length of the array.</param>
/// <param name="sortedNumbers"> Array which will contain the partial list of sorted numbers.</param>
[Cudafy]
public static void CudaSorter(GThread thread, long[,] unsortedNumbers, int[] lookbackCount, long[,] sortedNumbers)
{
int threadIndex = thread.threadIdx.x;
int blockIndex = thread.blockIdx.x;
int threadsPerBlock = thread.blockDim.x;
int gpuThread = (threadIndex + (blockIndex * threadsPerBlock));
if (gpuThread < 32)
{
int maxIndex = (lookbackCount[gpuThread] * 10) / 100;
int maxLookback = lookbackCount[gpuThread];
for (int i = 0; i < maxLookback; i++)
{
if (sortedNumbers[gpuThread, maxIndex] > unsortedNumbers[gpuThread, i])
{
//Update numbers
IdentifyPosition2(sortedNumbers, unsortedNumbers[gpuThread, i], maxIndex, gpuThread);
}
}
}
}
/// <summary>
/// Identifies the position in the sortedNumbers array where the new number should be placed.
/// </summary>
/// <param name="sortedNumbers"> Sorted numbers.</param>
/// <param name="newNumber"> Number to be included in the sorted array.</param>
/// <param name="maxIndex"> length of sortedNumbers array. </param>
/// <param name="gpuThread"> GPU thread index.</param>
[Cudafy(eCudafyType.Device)]
public static void CudaIdentifyPosition(long[,] sortedNumbers, long newNumber, int maxIndex, int gpuThread)
{
for (int i = 0; i < maxIndex; i++)
{
if (newNumber < sortedNumbers[gpuThread, i])
{
//Offset and add.
ArrayShifter2(sortedNumbers, i, newNumber, maxIndex, gpuThread);
break;
}
}
}
/// <summary>
/// Shifts all the elements to the right of the specified position, 1 position
/// to the right, and insert the new number in the specified position.
/// </summary>
/// <param name="sortedNumbers"> Sorted Numbers.</param>
/// <param name="position"> Where the new number needs to be inserted.</param>
/// <param name="newNumber"> New number to insert.</param>
/// <param name="maxIndex"> Length of sortedNumbers array.</param>
/// <param name="gpuThread"> GPU thread index.</param>
[Cudafy(eCudafyType.Device)]
public static void CudaArrayShifter(long[,] sortedNumbers, int position, long newNumber, int maxIndex, int gpuThread)
{
for (int i = maxIndex - 1; i > position; i--)
{
sortedNumbers[gpuThread, i] = sortedNumbers[gpuThread, i - 1];
}
sortedNumbers[gpuThread, position] = newNumber;
}
The above executes in 2.8 milliseconds i.e. ~ 4x slower.
I've already tried the following:
- Declared local variable for
maxLookBack
count and used that in the for loop => no improvement. - Changed data types from
long[,]
toint[,]
=> 2.6 milliseconds (This isn't feasible as I need to use long.) - Changed
int[,]
toint[]
=> 1.3 milliseconds (This isn't feasible either as I need to pass multiple arrays to the GPU to keep it occupied.) I was surprised how much this affected the time.
EDIT: I modified the code due to Henk's comments. I now ran the GPU version on the GPU with unsortedNumbers[32,1900]
vs a single thread on the CPU sorting 1 array. Even when I multiply the CPU time by 32, it's still considerably lower than the GPU's time.