Halide with GPU (OpenGL) as Target - benchmarking and using HalideRuntimeOpenGL.h

Question

I am new to Halide. I have been playing around with the tutorials to get a feel for the language. Now, I am writing a small demo app to run from command line on OSX.

My goal is to perform a pixel-by-pixel operation on an image, schedule it on the GPU and measure the performance. I have tried a couple things which I want to share here and have a few questions about the next steps.

First approach

I scheduled the algorithm on GPU with Target being OpenGL, but because I could not access the GPU memory to write to a file, in the Halide routine, I copied the output to the CPU by creating Func cpu_out similar to the glsl sample app in the Halide repo

pixel_operation_cpu_out.cpp

#include "Halide.h"
#include <stdio.h>

using namespace Halide;

const int _number_of_channels = 4;

int main(int argc, char** argv)
{
    ImageParam input8(UInt(8), 3);

    input8
        .set_stride(0, _number_of_channels) // stride in dimension 0 (x) is three
        .set_stride(2, 1); // stride in dimension 2 (c) is one

    Var x("x"), y("y"), c("c");

    // algorithm
    Func input;
    input(x, y, c) = cast<float>(input8(clamp(x, input8.left(), input8.right()),
                                 clamp(y, input8.top(), input8.bottom()),
                                 clamp(c, 0, _number_of_channels))) / 255.0f;

    Func pixel_operation;

    // calculate the corresponding value for input(x, y, c) after doing a 
    // pixel-wise operation on each each pixel. This gives us pixel_operation(x, y, c).
    // This operation is not location dependent, eg: brighten

    Func out;
    out(x, y, c) = cast<uint8_t>(pixel_operation(x, y, c) * 255.0f + 0.5f);
    out.output_buffer()
        .set_stride(0, _number_of_channels)
        .set_stride(2, 1);
    input8.set_bounds(2, 0, _number_of_channels); // Dimension 2 (c) starts at 0 and has extent _number_of_channels.
    out.output_buffer().set_bounds(2, 0, _number_of_channels);

    // schedule

     out.compute_root();
     out.reorder(c, x, y)
         .bound(c, 0, _number_of_channels)
         .unroll(c);

    // Schedule for GLSL

    out.glsl(x, y, c);

    Target target = get_target_from_environment();
    target.set_feature(Target::OpenGL);

    // create a cpu_out Func to copy over the data in Func out from GPU to CPU
    std::vector<Argument> args = {input8};
    Func cpu_out;
    cpu_out(x, y, c) = out(x, y, c);
    cpu_out.output_buffer()
        .set_stride(0, _number_of_channels)
        .set_stride(2, 1);
    cpu_out.output_buffer().set_bounds(2, 0, _number_of_channels);
    cpu_out.compile_to_file("pixel_operation_cpu_out", args, target);

    return 0;
}

Since I compile this AOT, I make a function call in my main() for it. main() resides in another file.

main_file.cpp

Note: the Image class used here is the same as the one in this Halide sample app

int main()
{
    char *encodeded_jpeg_input_buffer = read_from_jpeg_file("input_image.jpg");
    unsigned char *pixelsRGBA = decompress_jpeg(encoded_jpeg_input_buffer);

    Image input(width, height, channels, sizeof(uint8_t), Image::Interleaved);
    Image output(width, height, channels, sizeof(uint8_t), Image::Interleaved);
    input.buf.host = &pixelsRGBA[0];
    unsigned char *outputPixelsRGBA = (unsigned char *)malloc(sizeof(unsigned char) * width * height * channels);
    output.buf.host = &outputPixelsRGBA[0];

    double best = benchmark(100, 10, [&]() {
         pixel_operation_cpu_out(&input.buf, &output.buf);
    });

    char* encoded_jpeg_output_buffer = compress_jpeg(output.buf.host);
    write_to_jpeg_file("output_image.jpg", encoded_jpeg_output_buffer);
}

This works just fine and gives me the output I expect. From what I understand, cpu_out makes the values in out available on the CPU memory, which is why I am able to access these values by accessing output.buf.host in main_file.cpp

Second approach:

The second thing I tried was to not do the copy to host from device in the Halide schedule by creating Func cpu_out, instead using copy_to_host function in main_file.cpp.

pixel_operation_gpu_out.cpp

#include "Halide.h"
#include <stdio.h>

using namespace Halide;

const int _number_of_channels = 4;

int main(int argc, char** argv)
{
    ImageParam input8(UInt(8), 3);

    input8
        .set_stride(0, _number_of_channels) // stride in dimension 0 (x) is three
        .set_stride(2, 1); // stride in dimension 2 (c) is one

    Var x("x"), y("y"), c("c");

    // algorithm
    Func input;
    input(x, y, c) = cast<float>(input8(clamp(x, input8.left(), input8.right()),
                                 clamp(y, input8.top(), input8.bottom()),
                                 clamp(c, 0, _number_of_channels))) / 255.0f;

    Func pixel_operation;

    // calculate the corresponding value for input(x, y, c) after doing a 
    // pixel-wise operation on each each pixel. This gives us pixel_operation(x, y, c).
    // This operation is not location dependent, eg: brighten

    Func out;
    out(x, y, c) = cast<uint8_t>(pixel_operation(x, y, c) * 255.0f + 0.5f);
    out.output_buffer()
        .set_stride(0, _number_of_channels)
        .set_stride(2, 1);
    input8.set_bounds(2, 0, _number_of_channels); // Dimension 2 (c) starts at 0 and has extent _number_of_channels.
    out.output_buffer().set_bounds(2, 0, _number_of_channels);

    // schedule

     out.compute_root();
     out.reorder(c, x, y)
         .bound(c, 0, _number_of_channels)
         .unroll(c);

    // Schedule for GLSL

    out.glsl(x, y, c);

    Target target = get_target_from_environment();
    target.set_feature(Target::OpenGL);

    std::vector<Argument> args = {input8};
    out.compile_to_file("pixel_operation_gpu_out", args, target);

    return 0;
}

main_file.cpp

#include "pixel_operation_gpu_out.h"
#include "runtime/HalideRuntime.h"

int main()
{
    char *encodeded_jpeg_input_buffer = read_from_jpeg_file("input_image.jpg");
    unsigned char *pixelsRGBA = decompress_jpeg(encoded_jpeg_input_buffer);

    Image input(width, height, channels, sizeof(uint8_t), Image::Interleaved);
    Image output(width, height, channels, sizeof(uint8_t), Image::Interleaved);
    input.buf.host = &pixelsRGBA[0];
    unsigned char *outputPixelsRGBA = (unsigned char *)malloc(sizeof(unsigned char) * width * height * channels);
    output.buf.host = &outputPixelsRGBA[0];

    double best = benchmark(100, 10, [&]() {
         pixel_operation_gpu_out(&input.buf, &output.buf);
    });

    int status = halide_copy_to_host(NULL, &output.buf);

    char* encoded_jpeg_output_buffer = compress_jpeg(output.buf.host);
    write_to_jpeg_file("output_image.jpg", encoded_jpeg_output_buffer);

    return 0;
}

So, now, what I think is happening is that pixel_operation_gpu_out is keeping output.buf on the GPU and when I do copy_to_host, that's when I get the memory copied over to the CPU. This program gives me the expected output as well.

Questions:

The second approach is much slower than the first approach. The slow part is not in the benchmarked part though. For example, for first approach, I get 17ms as benchmarked time for a 4k image. For the same image, in the second approach, I get the benchmarked time as 22us and the time taken for copy_to_host is 10s. I'm not sure if this behavior is expected since both approach 1 and 2 are essentially doing the same thing.

The next thing I tried was to use [HalideRuntimeOpenGL.h][3] and link textures to input and output buffers to be able to draw directly to a OpenGL context from main_file.cpp instead of saving to a jpeg file. However, I could find no examples to figure out how to use the functions in HalideRuntimeOpenGL.h and whatever things I did try on my own were always giving me run time errors which I could not figure out how to solve. If anyone has any resources they can point me to, that will be great.

Also, any feedback on the code I have above are welcome too. I know it works and is doing what I want but it could be the completely wrong way of doing it and I wouldn't know any better.

score 0 · Accepted Answer · answered Jun 13 '16 at 21:24

0

Mostly likely the reason for the 10s to copy memory back is because the GPU API has queued all the kernel invocations and then waits on them to finish when halide_copy_to_host is called. You can call halide_device_sync inside the benchmark timing after running all the compute calls to handle get the compute time inside the loop without the copy back time.

I cannot tell from the code how many times the kernel is being run from this code. (My guess is 100, but it may be that those arguments to benchmark setup some sort of parameterization where it tries to run it as many times as need be to get significance. If so, that is a problem because the queuing call is really fast but the compute is of course async. If this is the case, you can do things like queue ten calls and then call halide_device_sync and play with the number "10" to get a real picture of how long it takes.)

answered Jun 13 '16 at 21:24

Zalman Stern

3,161
12
18

I tried using halide_device_sync as you mentioned. I'm not sure it's doing what it's supposed to (i.e. wait for the gpu to finish running all it's tasks). I still see slow copy times with or without halide_device_sync. Also, I tried the queuing solution you suggested but didn't get much out of it - the results seem equally cryptic still. – user5597458 Jun 14 '16 at 21:50
The call to pixel_operation_gpu_out will always queue the request. Looks like halide_device_sync doesn't do anything for OpenGL. (There's a TODO. You could try uncommenting the glFinish(). I'm not sure why it is commented out.) How many times is pixel_operation_gpu_out being called? – Zalman Stern Jun 14 '16 at 23:41
I tried calling it as little as 5 times or even just 1 time. The thing that happens when i call it once is that the copy time does go down but it's still way higher than first approach - 200 to 300 ms. – user5597458 Jun 15 '16 at 00:54
1

Also, does your benchmark harness do a warm up? There's an issue that this will also be asynchronous and the first call will take a long time to compile the shader. I'd suggest the following method to assess the timing: 1) Outside of the benchmark loop, call pixel_operation_gpu_out once and then use halide_copy_to_host to force it to complete. 2) do a series of benchmarks that call pixel_operation_gpu_out 1, 5, 10, 100 times (for some series) and then do halide_copy_to_host inside the loop. From this you should be able to establish the time to compute and the time to copy back. – Zalman Stern Jun 15 '16 at 19:49
Also, you can turn on the debug flag in the target to get output about the OpenGL calls being made. This is bad for timing as it slows things down, but it may give you some insight into what is taking the time. – Zalman Stern Jun 15 '16 at 19:50
thank you so much. The method you suggested to assess the timing worked :D The speeds are much faster now, as was expected (~22ms - which seems to match up better with the 17ms in First approach). – user5597458 Jun 16 '16 at 00:10
1

I just pushed a fix to halide_device_sync for OpenGL. – Zalman Stern Jun 16 '16 at 05:54

Halide with GPU (OpenGL) as Target - benchmarking and using HalideRuntimeOpenGL.h

1 Answers1