0

I have a compute shader program which looks for the maximum value in the float array. it uses reduction (compare two values and save the bigger one to the output buffer). Now I am not quite sure how to run this program from the Java code (using jogamp). In the display() method I run the program once (every time with the halved array in the input SSBO = result from previous iteration) and finish this when the array with results has only one item - the maximum.

Is this the correct method? Every time in the display() method creating and binding input and output SSBO, running the shader program and then check how many items was returned?

Java code:

    FloatBuffer inBuffer = Buffers.newDirectFloatBuffer(array);
    gl.glBindBuffer(GL3ES3.GL_SHADER_STORAGE_BUFFER, buffersNames.get(1));
    gl.glBufferData(GL3ES3.GL_SHADER_STORAGE_BUFFER, itemsCount * Buffers.SIZEOF_FLOAT, inBuffer,
            GL3ES3.GL_STREAM_DRAW);
    gl.glBindBufferBase(GL3ES3.GL_SHADER_STORAGE_BUFFER, 1, buffersNames.get(1));

    gl.glDispatchComputeGroupSizeARB(groupsCount, 1, 1, groupSize, 1, 1);

    gl.glMemoryBarrier(GL3ES3.GL_SHADER_STORAGE_BARRIER_BIT);

    ByteBuffer output = gl.glMapNamedBuffer(buffersNames.get(1), GL3ES3.GL_READ_ONLY);

Shader code:

#version 430
#extension GL_ARB_compute_variable_group_size : enable
layout (local_size_variable) in;

layout(std430, binding = 1) buffer MyData {
    vec4 elements[];
} data;

void main() {
    uint index = gl_GlobalInvocationID.x;

    float n1 = data.elements[index].x;
    float n2 = data.elements[index].y;
    float n3 = data.elements[index].z;
    float n4 = data.elements[index].w;

    data.elements[index].x = max(max(n1, n2), max(n3, n4));
}
Artholl
  • 1,291
  • 1
  • 19
  • 38
  • 1
    so a few comments: you may not want this in a display function. You can just do a loop for however many times needed. By having it run at the display function you are going to be syncing with the buffer swap most likely, so that each iteration will be spaced 1/60th of a second apart. Also if you know the original size then you should just call it a fixed number of times. Checking the number of items will most likely be an expensive call. – Luple May 20 '18 at 23:17
  • Thanks for the comments. That makes sense. Is there also some way, how can I decrease the number of SSBO bindings? Because I don't need the intermediate results, so for me they don't need to leave the GPU at all. I need only the final result. I think it should be faster to calculate everything on the GPU at once. Is it possible to somehow directly pipeline several compute shader iterations without moving data back and forth? – Artholl May 21 '18 at 09:14
  • Okay, so my computer shader work is nonexistant, but I have done similar things with fragment shaders (since I was using an env without compute shaders), so this may not be the best answer. Basically one of the most expensive opporations is moving data from gpu<->cpu memory. Once you upload the data, and render to a texture (ssbo I believe is the same thing just with some fancy bells and whistles), then you can use that output as input without moving it to the cpu at all. That being said I think compute shaders have some things such as local/shared memory like opencl does. – Luple May 21 '18 at 16:06
  • if this is the case, then you can actually just do the entire solve with one gpu call and inside that call just do the loop. You may want to look at an opencl equivalent and see what memory features they are taking advantage of and if compute shaders support these. They might since I know opengl has come a long ways and not too sure about all of the features. – Luple May 21 '18 at 16:07
  • I agree what is the most expensive operation. That was somehow not clearly what I was asking for. Somehow remove these communication. I was able to find only the `shared` variable which is shared in the work group. I was not able to found something that would work as I wanted. Also how I could set the number of work groups (and their sizes) if there is only one call for whole loop? – Artholl May 21 '18 at 18:24
  • So now this is getting into computer shader knowledge that I am unfamiliar with. I believe if dealing iwth this same problem in something like OpenCL, the work group size will be a decision that you would have to make. I believe by nature of how the GPU programs function, it is cheaper to take the performance loss using a consistent work group size and not adjusting it than to adjust each time and recall the function. This results in lets say you ahve 100 items, and a work group size of 10, you have 10 * n = 100 so 10 work groups. – Luple May 21 '18 at 21:11
  • Each of those 10 finds the maximum and then you can do the last bit on the cpu or pass it back in to do it one last time. Lets say you have 1,000,000,000 elements and the wg size is 10,000. You will result in 100,000 maximums. Which can be used as an input for the next call which will result in 10 maximums. This reduces a huge function down to simply 2 calls. I am no expert in the field, this is my understanding. May want to get insight from someone with more experience. Also wouldn't be too hard to profile the performance though and see how you do. – Luple May 21 '18 at 21:11
  • My experiments with switching input and output buffers were not successful so for now I at least switch to using just one buffer for both input and output. It is not ideal solution, but it is faster than my previous solution and it consumes less memory. – Artholl May 25 '18 at 06:52
  • oh wait, this doesn't seem right. I would have to look at your code but this doesn't seem right from my opengl experience. Maybe computer shaders are different but still seems strange. Input and output buffers are not supposed to be the same. Unless this changed in newer opengl versions. – Luple May 25 '18 at 06:58
  • I added simplified code. I don't think that it is wrong since you can read from this type of buffer and also write to it. And I change data just in the *current position*. – Artholl May 25 '18 at 07:23
  • Yeah, looking at your code this is definitely all stuff I am unfamiliar with. Haven't worked with compute shaders in this way. Just used the old school fragment shader tricks. If it works then I am sure its not an issue. – Luple May 25 '18 at 16:05

0 Answers0