copying an image onto another with JOCL/OpenCL

Question

so my goal is to use the GPU for my brand new Java project which is to create a game and the game engine itself (I think it is a very good way to learn in deep how it works).

I was using multi-threading on the CPU with java.awt.Graphics2D to display my game, but i have observed on other PCs that the game was running below 40FPS so i have decided to learn how to use GPU (I will be still rendering all objects in a for loop then draw the image on screen).

For that reason, I started to code following the OpenCL documentation and the JOCL samples a small simple test which is to paint the texture onto the background image (let's amdit that every entities has a texture).

This method is called in each render call and it is given the background, the texture, and the position of this entity as arguments.

Both codes below has been updated to fit @ProjectPhysX recommandations.

public static void XXX(final BufferedImage output_image, final BufferedImage input_image, float x, float y) {
        cl_image_format format = new cl_image_format();
        format.image_channel_order = CL_RGBA;
        format.image_channel_data_type = CL_UNSIGNED_INT8;

        //allocate ouput pointer
        cl_image_desc output_description = new cl_image_desc();
        output_description.buffer = null; //must be null for 2D image
        output_description.image_depth = 0; //is only used if the image is a 3D image
        output_description.image_row_pitch = 0; //must be 0 if host_ptr is null
        output_description.image_slice_pitch = 0; //must be 0 if host_ptr is null
        output_description.num_mip_levels = 0; //must be 0
        output_description.num_samples = 0; //must be 0
        output_description.image_type = CL_MEM_OBJECT_IMAGE2D;
        output_description.image_width = output_image.getWidth();
        output_description.image_height = output_image.getHeight();
        output_description.image_array_size = output_description.image_width * output_description.image_height;

        cl_mem output_memory = clCreateImage(context, CL_MEM_WRITE_ONLY, format, output_description, null, null);
        
        //set up first kernel arg
        clSetKernelArg(kernel, 0, Sizeof.cl_mem, Pointer.to(output_memory));
        
        //allocates input pointer
        cl_image_desc input_description = new cl_image_desc();
        input_description.buffer = null; //must be null for 2D image
        input_description.image_depth = 0; //is only used if the image is a 3D image
        input_description.image_row_pitch = 0; //must be 0 if host_ptr is null
        input_description.image_slice_pitch = 0; //must be 0 if host_ptr is null
        input_description.num_mip_levels = 0; //must be 0
        input_description.num_samples = 0; //must be 0
        input_description.image_type = CL_MEM_OBJECT_IMAGE2D;
        input_description.image_width = input_image.getWidth();
        input_description.image_height = input_image.getHeight();
        input_description.image_array_size = input_description.image_width * input_description.image_height;

        DataBufferInt input_buffer = (DataBufferInt) input_image.getRaster().getDataBuffer();
        int input_data[] = input_buffer.getData();

        cl_mem input_memory = clCreateImage(context, CL_MEM_READ_ONLY | CL_MEM_USE_HOST_PTR, format, input_description, Pointer.to(input_data), null);

        //loads the input buffer to the gpu memory
        long[] input_origin = new long[] { 0, 0, 0 };
        long[] input_region = new long[] { input_image.getWidth(), input_image.getHeight(), 1 };
        int input_row_pitch = input_image.getWidth() * Sizeof.cl_uint; //the length of each row in bytes
        clEnqueueWriteImage(commandQueue, input_memory, CL_TRUE, input_origin, input_region, input_row_pitch, 0, Pointer.to(input_data), 0, null, null);
        
        //set up second kernel arg
        clSetKernelArg(kernel, 1, Sizeof.cl_mem, Pointer.to(input_memory));

        //set up third and fourth kernel args
        clSetKernelArg(kernel, 2, Sizeof.cl_float, Pointer.to(new float[] { x }));
        clSetKernelArg(kernel, 3, Sizeof.cl_float, Pointer.to(new float[] { y }));
        
        //blocks until all previously queued commands are issued
        clFinish(commandQueue);

        //enqueue the program execution
        long[] globalWorkSize = new long[] { input_description.image_width, input_description.image_height };
        clEnqueueNDRangeKernel(commandQueue, kernel, 2, null, globalWorkSize, null, 0, null, null);

        //transfer the output result back to host
        DataBufferInt output_buffer = (DataBufferInt) output_image.getRaster().getDataBuffer();
        int output_data[] = output_buffer.getData();
        long[] output_origin = new long[] { 0, 0, 0 };
        long[] output_region = new long[] { output_description.image_width, output_description.image_height, 1 };
        int output_row_pitch = output_image.getWidth() * Sizeof.cl_uint;
        clEnqueueReadImage(commandQueue, output_memory, CL_TRUE, output_origin, output_region, output_row_pitch, 0, Pointer.to(output_data), 0, null, null);

        //free pointers
        clReleaseMemObject(input_memory);
        clReleaseMemObject(output_memory);
    }

And here's the program source runned on the kernel.

const sampler_t sampler = CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_CLAMP | CLK_FILTER_NEAREST;

__kernel void drawImage(__write_only image2d_t dst_image, __read_only image2d_t src_image, float xoff, float yoff)
{
    const int x = get_global_id(0);
    const int y = get_global_id(1);

    int2 in_coords = (int2) { x, y };

    uint4 pixel = read_imageui(src_image, sampler, in_coords);
    pixel = -16184301;
    printf("%d, %d, %u\n", x, y, pixel);

    const int sx = get_global_size(0);
    const int sy = get_global_size(1);

    int2 out_coords = (int2) { ((int) xoff + x) % sx, ((int) yoff + y) % sy};
    
    write_imageui(dst_image, out_coords, pixel);
}

Without the call to write_imageui, the background is painted black, otherwhise it is white. At the moment, I am a bit struggling to understand why pixel = 0 in the C function, but i think that someone familiar with JOCL would found out very quick my error in this code. I am very confused with this code for today, maybe tomorrow, but i don't feel like I will ever catch myself my mistake. For that reason i request your help to review my code. I feel like an idiot that i can't figure it out at that point.

ProjectPhysX · Answer 1 · 2021-02-18T19:24:17.700

1

Try

    const int sx = get_global_size(0);
    const int sy = get_global_size(1);
    int2 out_coords = (int2) { (xoff + x)%sx, (yoff + y)%sy};

to avoid errors or undefined behaviour. Right now you are writing into Nirwana if the coordinate+offset is putside the image region. Also there is no clEnqueueWriteImage before the kernel is called, so src_image on the GPU is undefined and may contain random values.

OpenCL requires kernel parameters to be declared in global memory space:

__kernel void drawImage(global image2d_t dst_image, global image2d_t src_image, global float xoff, global float yoff)

Also as someone who has written a graphics engine in Java, C++ and GPU-parallelized in OpenCL, let me give you some guidance: In the Java code, you probably use painter's algorithm: Make a list of all drawn objects with their approximate z-coordinates, sort the objects by z-coordinate and draw them back-to-front in a single for-loop. On the GPU, painter's algorithm won't work as you cannot parallelize it. Instead you have a list of objects (lines/triangles) in 3D space, and you parallelize over this list: Each GPU thread rasterizes a single triangle, all threads at the same time, and draw the pixels on the frame at the same time. To solve the draing order problem, you use a z-buffer: an image consisting of a z-coordinate per pixel. During rasterization of the line/triange, you calculate the z-coordinate for every pixel, and only if it is larger than the one previously in the z-buffer at that pixel, you draw the new color.

Regarding performance: java.awt.Graphics2D is very efficient in terms of CPU usage, you can do ~40k triangles per frame at 60fps. With OpenCL, expect ~30M triangles per frame at 60fps.

edited Feb 18 '21 at 19:24

answered Feb 16 '21 at 21:41

ProjectPhysX

4,535
2
14
34

Thx a lot for the quick respond @ProjectPhysX (and the advices on writting a game engine), I will test it later. I don't think that I have understood all of it, but you put me back on the right track. Atm, I have a scene that maps z-index to a collection of objects (i must also have ommited that it will be a 2d engine only so every entity is rectangle for the moment). – Amaury Le Roux Dupeyron Feb 16 '21 at 22:57
**UPDATE** I'v updated my code to match your suggestions. I added the c lines to avoid errors as you told me to do and clEnqueueWriteImage before the call of clEnqueueNDRangeKernel like : ` clEnqueueWriteImage(commandQueue, input, true, new long[] {0, 0, 0}, new long[] {texture.getWidth(), texture.getHeight(), 1}, 0, (long) 0, Pointer.to(inputData), 0, null, null); ` I still issue the same behaviour, pixel = 0 even after pixel = -16184301 which is surprising to me. Tbh more than my ocl skill, my english is limiting my understanding. I need some clarification. Would be very helpful. – Amaury Le Roux Dupeyron Feb 17 '21 at 00:00
I just found another possible cause for the error: Are you missing `clCreateKernel` before the first call to `clSetKernelArg`? If `kernel` is not initialized before `clSetKernelArg`, it won't work. – ProjectPhysX Feb 17 '21 at 09:33
I have created the kernel once in a method above called when I init my application. That is not enough to call it multiple time after ? You are right in fact, all the args equal zero, there might be something wrong here. – Amaury Le Roux Dupeyron Feb 17 '21 at 10:51
Ok that's fine then. Once the `kernel` is created and kernel arguments are linked, you can call it as many times as required. But all arguments being zero is strange nevertheless. I can't make out the definitive cause yet. – ProjectPhysX Feb 17 '21 at 21:38
I have updated the source code in my post to match your responds, can you take a look one last time ? – Amaury Le Roux Dupeyron Feb 18 '21 at 00:13
I see what's wrong, or at least part of it. `pixel` is type `uint4`. You cannot print `pixel` as `uint`, but only its components `pixel.x`, `pixel.y`, `pixel.z`, `pixel.w`. You also cannot write a signed integer to `pixel`, but only unsigned integers to the components. Also keep in mind the range of the colors. For example: `pixel.x = 255;` will set the red component to its maximum value. – ProjectPhysX Feb 18 '21 at 08:23
You got it right, pixel.x gives a value for the red in range of 0-255 ! That helps me a lot. But still xoff and yoff are equal to zero for no obvious reason to me, I'll try to debug this this afternoon. – Amaury Le Roux Dupeyron Feb 18 '21 at 10:35
In your kernel parameters, you are missing the global keyword for each parameter. change it to `global image2d_t dst_image, global image2d_t src_image, global float xoff, global float yoff`. – ProjectPhysX Feb 18 '21 at 19:21
I have never seen global global args in jocl samples, neither clWriteImage but that makes sens. In SimpleImage it is like this __kernel void rotateImage(__read_only image2d_t sourceImage, __write_only image2d_t targetImage, float angle). Plus i have an error when i write global in my project... – Amaury Le Roux Dupeyron Feb 19 '21 at 21:21

copying an image onto another with JOCL/OpenCL

1 Answers1