Algorithm for efficient memory management
You're asking for the best setup for OpenCL memory. I'm assuming you mostly care about high performance and not too much of a much of a size overhead.
This means you should perform as many operations as possible on the GPU. Syncing between CPU/GPU should be minimized.
Memory model
I will now describe in detail how such a memory and processing model should look like.
- Preallocate buffers with the maximum size and fill them over time.
- Track how many elements currently are in the buffer
- Have separate buffers for validity and your data. The validity buffer indicates the validity for each element
Adding elements
Adding elements can be done via the following principle:
- Have a buffer with host pointer for input data. The size of the buffer is determined by the maximum number of input elements
- When you receive data, copy it onto the host buffer and sync it to the GPU
- (Optional) Preprocess input data on the GPU
- In a kernel, add input data and corresponding validity behind the last element in the global buffer. Input points that are empty (maybe you just got 100 input points instead of 10000), just mark them as invalid.
This has several effects:
- Adding can be completely done in parallel
- You only have to sync a small buffer (input data buffer) to the GPU
- When adding input data, you always add the maximum amount of input elements into the buffer, but most of them will be empty/invalid. So when you frequently add points
- If your rendering step is not able to discard invalid points, you must remove invalid points from the model before rendering.
Otherwise, you can postpone cleaning up to a point, where it is only needed because the size of the model becomes to big and threatens to overflow.
Removing elements
Removing elements should be done via the following principle:
- Have a kernel that determines if an elements becomes invalid. If so, just mark its validity accordingly (if you want you can zero nor NAN out the data, too, but that is not necessary).
- Have an algorithm that is able to remove invalid elements from the buffer and give you the information about the number of valid,
consecutive elements in the buffer (that information is needed when adding elements).
Such an algorithm will require you to perform sorts and a search using parallel reduction.
Sorting elements in parallel
Sorting a buffer, especially one with many elements is highly demanding. You should use available implementations to do so.
Simple Bitonic sort:
If you do not care about the maximum possible performance and simple code, this is your choice.
- Implementation available: https://software.intel.com/en-us/articles/bitonic-sorting
- Simple to integrate, just a single kernel.
- Can only sort 4*2^n elements (as far as I remember).
- WARNING: This sort does not work with numbers larger than one billion (1,000,000,000). Not sure why but finding that out cost me quite some time.
Fast radix sort:
If you care about maximum performance and have lots of elements to sort (1 million up to 1 billion or even more), this is your choice.
Finding out the number of valid elements
If the buffer has been sorted and all invalid elements have been removed, you could simply parallely count the number of valid values, or simply find the first index of the first invalid element (this requires you to have unused buffer space invalidated). Both ways will give you the number of valid elements
Problem size vs. sorting size restrictions
To overcome the problems that arise with only being able to sort a fixed number of elements, just pad out with values whose sorting behavior you know.
Example:
- You want to sort 10,000 integers with values between 0 and 10 million in ascending order.
- You can only sort 2^n elements
The closest you will get is 2^14 = 16384.
- Have a buffer for sorting with 2^14 elements
- Fill the buffer with the 10000 values to sort.
- Fill all remaining values of the buffer with a value you know will be sorted behind the 10,000 actually existing values.
Since you know your value range (0 to 10 million), you could pick 11 million as filling value.
In-place sorting problem
In-place sorting and removing of elements is difficult (but possible) to implement. An easier solution is to determine the indices of consecutive valid elements and write them to a new buffer in that order and then swap buffers.
But this requires you to swap buffers or copy back which costs both performance and space. Chose the lesser evil in your case.
More advice
- Only add wait-events, if you are still not content with the performance. However, this will complicate your code and possibly introduce bugs (which won't even be your fault - there is a nasty bug with Nvidia cards and OpenCL where wait-events are not destoyed and memory leaks - this will slowly but surely cause problems).
- Be very careful with syncing/mapping buffers to CPU too early, as this sync-call will force all kernels using this buffer to finish
- If adding elements rarely occurs, and your rendering step is able to discard invalid elements, you can postpone removing elements from the buffer until it is really needed (too many elements threaten to overflow your buffer).