What is the Optimal Memory Setup for OpenCL where the host needs access at regular time steps?

Question

I'm looking to find the best way to setup the CL memory objects for my project, which does a device side physics simulation. The buffers will be accessed by the host every frame, approx every 16ms, to get the updated data for rendering. Unfortunately, I cannot send the new data straight to the GPU via a VBO.

The data in the buffer consists of structs with 3 cl_float4's and one cl_float. I also want to have the ability for the host to update some of the structs in the buffer, this will not be per-frame.

Currently I'm looking to have all the data be allocated/stored on the GPU and using map/unmap whenever the host requires access. But this brings up two issues that I can see:

Still require a device to host copy for rendering
Buffer must be rebuilt whenever objects are added/removed from the simulation. Or additional validation data must exist per struct to check if this object is "alive"/valid...

Any advice is appreciated. If you need any additional info or code snippets, just let me know.

Thank you.

Can you elaborate on why you "cannot send the new data straight to the GPU via a VBO"? Because CL/GL interoperability with a shared context seems to be what you need. — Baiz, Sep 04 '15 at 11:08
@Baiz - Since this code will be in a library that will be independant of any rendering, I have no idea which API will be used for rendering. GL, DX11, DX12, etc... Nor do I wish to restrict it to one. — NIZGTR, Sep 04 '15 at 14:15
I understand. Regarding the memory model, it depends on several things: * 16-byte aligment may help. * How is your data accessed in both kernel and rendering steps? If corresponding values (e.g. a single float4) from multiple structs are processed together, separate buffers may prove beneficial. By filling the VBO from CPU, you will have no problems with restructuring the data if rendering steps require a very specific data format. * Rebuilding the buffer sounds costly, you will have it easier just marking structs as old/invalid and then, in separate steps, reordering the buffer. — Baiz, Sep 04 '15 at 15:11
* Also, how do you determine which elements are to be removed and is it important, where new elements are added? * Using asynchronous calls and wait-events could help with overall performance, but clutters the code and may introduce bugs (modern Nvidia cards actually produce a memory leak when using wait-events - not sure if that only happens with a shared context). — Baiz, Sep 04 '15 at 15:30
@Baiz - In the kernel, 2 of the float4's are read from and written to, the other float4 and float are read only. Rendering only requires one of the float4's. Regarding marking structs as invalid; I completely agree. Elements are removed/invalidated by index and ordering of new elements is not important. Async calls are certainly something I'm considering. — NIZGTR, Sep 04 '15 at 19:22
How many elements are added? Does it vary or do you always add x elements, but from these, y are invalid (and need not be added)? — Baiz, Sep 04 '15 at 19:52
The number of elements will not change very often and will be constant between kernel calls. Basically the number of elements will not change on a per-frame basis. — NIZGTR, Sep 04 '15 at 20:03
Okay, but when you add elements, how many is it. Does it vary and is there a maximum limit? Because that will determine if adding elements can be done parallely in a kernel call. Also, when removing/invalidating elements, is this decided within a kernel? — Baiz, Sep 04 '15 at 20:31
Is the decision of adding elements made on the CPU or the kernel (where does the data come from). E.g. if you receive new input data of x elements, is there a kernel that decides which of the x elements are added? Or is there even a correspondece search with existing elements? — Baiz, Sep 04 '15 at 20:40
How many added varies, there is a defined maximum limit, adding and removing is decided on the CPU but not on a per-frame basis. Elements are guaranteed to be added/removed only at the end of each frame. Adds are usually in batches of elements but the size of the batch varies. — NIZGTR, Sep 04 '15 at 20:55
Is there a specific reason why adding/removing is done on the CPU instead of the GPU? I can see various disadvantages: - the whole buffer needs CPU traversal - the whole buffer needs to be uploaded instead of only the few elements that are to be added — Baiz, Sep 04 '15 at 21:00
Lets take this to chat if you can: https://chat.stackoverflow.com/rooms/88843/opencl-memory-talk — Baiz, Sep 04 '15 at 21:04
I'm 2 rep short for chat access... By adding on the CPU; the new data gets copied to the CL buffer at the "free"/invalid slots. The buffer is never going to be fully rebuilt on adds/removes. — NIZGTR, Sep 04 '15 at 21:25
Seems like the chat feature does not recognize your current rep — Baiz, Sep 04 '15 at 21:34

Baiz · Accepted Answer · 2015-09-05T10:57:40.633

Algorithm for efficient memory management

You're asking for the best setup for OpenCL memory. I'm assuming you mostly care about high performance and not too much of a much of a size overhead. This means you should perform as many operations as possible on the GPU. Syncing between CPU/GPU should be minimized.

Memory model

I will now describe in detail how such a memory and processing model should look like.

Preallocate buffers with the maximum size and fill them over time.
Track how many elements currently are in the buffer
Have separate buffers for validity and your data. The validity buffer indicates the validity for each element

Adding elements

Adding elements can be done via the following principle:

Have a buffer with host pointer for input data. The size of the buffer is determined by the maximum number of input elements
When you receive data, copy it onto the host buffer and sync it to the GPU
(Optional) Preprocess input data on the GPU
In a kernel, add input data and corresponding validity behind the last element in the global buffer. Input points that are empty (maybe you just got 100 input points instead of 10000), just mark them as invalid.

This has several effects:

Adding can be completely done in parallel
You only have to sync a small buffer (input data buffer) to the GPU
When adding input data, you always add the maximum amount of input elements into the buffer, but most of them will be empty/invalid. So when you frequently add points
If your rendering step is not able to discard invalid points, you must remove invalid points from the model before rendering. Otherwise, you can postpone cleaning up to a point, where it is only needed because the size of the model becomes to big and threatens to overflow.

Removing elements

Removing elements should be done via the following principle:

Have a kernel that determines if an elements becomes invalid. If so, just mark its validity accordingly (if you want you can zero nor NAN out the data, too, but that is not necessary).
Have an algorithm that is able to remove invalid elements from the buffer and give you the information about the number of valid, consecutive elements in the buffer (that information is needed when adding elements). Such an algorithm will require you to perform sorts and a search using parallel reduction.

Sorting elements in parallel

Sorting a buffer, especially one with many elements is highly demanding. You should use available implementations to do so.

Simple Bitonic sort:

If you do not care about the maximum possible performance and simple code, this is your choice.

Implementation available: https://software.intel.com/en-us/articles/bitonic-sorting
Simple to integrate, just a single kernel.
Can only sort 4*2^n elements (as far as I remember).
WARNING: This sort does not work with numbers larger than one billion (1,000,000,000). Not sure why but finding that out cost me quite some time.

Fast radix sort:

If you care about maximum performance and have lots of elements to sort (1 million up to 1 billion or even more), this is your choice.

Implementation available: https://github.com/sschaetz/nvidia-opencl-examples/tree/master/OpenCL/src/oclRadixSort
More difficult to integrate, serveral kernel calls
Can only sort 2^n elements (as far as I remember)
Faster than Bitonic sort, especially with more than 1 million elements

Finding out the number of valid elements

If the buffer has been sorted and all invalid elements have been removed, you could simply parallely count the number of valid values, or simply find the first index of the first invalid element (this requires you to have unused buffer space invalidated). Both ways will give you the number of valid elements

Problem size vs. sorting size restrictions

To overcome the problems that arise with only being able to sort a fixed number of elements, just pad out with values whose sorting behavior you know. Example:

You want to sort 10,000 integers with values between 0 and 10 million in ascending order.
You can only sort 2^n elements

The closest you will get is 2^14 = 16384.

Have a buffer for sorting with 2^14 elements
Fill the buffer with the 10000 values to sort.
Fill all remaining values of the buffer with a value you know will be sorted behind the 10,000 actually existing values. Since you know your value range (0 to 10 million), you could pick 11 million as filling value.

In-place sorting problem

In-place sorting and removing of elements is difficult (but possible) to implement. An easier solution is to determine the indices of consecutive valid elements and write them to a new buffer in that order and then swap buffers. But this requires you to swap buffers or copy back which costs both performance and space. Chose the lesser evil in your case.

More advice

Only add wait-events, if you are still not content with the performance. However, this will complicate your code and possibly introduce bugs (which won't even be your fault - there is a nasty bug with Nvidia cards and OpenCL where wait-events are not destoyed and memory leaks - this will slowly but surely cause problems).
Be very careful with syncing/mapping buffers to CPU too early, as this sync-call will force all kernels using this buffer to finish
If adding elements rarely occurs, and your rendering step is able to discard invalid elements, you can postpone removing elements from the buffer until it is really needed (too many elements threaten to overflow your buffer).

What is the Optimal Memory Setup for OpenCL where the host needs access at regular time steps?

1 Answers1

Algorithm for efficient memory management