Halide GPU scheduler slower than CPU

Question

I have written a simple Halide code to compute square of a numbers from 0 to n, however it takes 22x longer on GPU than on CPU.

#include"stdafx.h"
#include "Halide.h"
#include <stdio.h>
using namespace Halide;
#include "HalideRuntimeOpenCL.h"

#define GPU_TILE 16
#define COMPUTE_SIZE 1024

Target find_gpu_target();

// Define some Vars to use.
Halide::Var x, y, xo, yo, xi, yi;


// We're going to want to schedule a pipeline in several ways, so we
// define the pipeline in a class so that we can recreate it several
// times with different schedules.
class MyPipeline {
public:
    Halide::Func f;

    MyPipeline() {
        f(x) = x * x;
    }

    // Now we define methods that give our pipeline several different
    // schedules.
    void schedule_for_cpu() {

        // JIT-compile the pipeline for the CPU.
        Target target = get_host_target();
        f.compile_jit(target);

    }

    // Now a schedule that uses CUDA or OpenCL.
    bool schedule_for_gpu() {
        Target target = find_gpu_target();
        if (!target.has_gpu_feature()) {
            return false;
        }

        // Schedule f on the GPU in 16x16 tiles.
        f.gpu_tile(x, xo, xi, GPU_TILE);
        f.compile_jit(target);

        return true;
    }

    void test_performance() {
        // Test the performance of the scheduled MyPipeline.


        // Run the filter once to initialize any GPU runtime state.
        // Run it.
        Halide::Buffer<int> result = f.realize(COMPUTE_SIZE);

        // Now take the best of 3 runs for timing.
        double best_time = 0.0;
        for (int i = 0; i < 3; i++) {

            double t1 = clock();//current_time();

            // Run the filter 100 times.
            for (int j = 0; j < 100; j++) {
                // Run it.
                Halide::Buffer<int> result = f.realize(COMPUTE_SIZE);
                // Force any GPU code to finish by copying the buffer back to the CPU.
                result.copy_to_host();
            }

            double t2 = clock();// current_time();

            double elapsed = (t2 - t1) / 100;
            if (i == 0 || elapsed < best_time) {
                best_time = elapsed;
            }
            best_time = (t2 - t1) * 1000. / CLOCKS_PER_SEC;
        }

        printf("%1.4f milliseconds\n", best_time);  
    }
    bool test_correctness() {
        Halide::Buffer<int> result = f.realize(COMPUTE_SIZE);
        for (int i = 0; i < COMPUTE_SIZE; i++)
        {
            if (result(i) != i * i)
                return false;
        }
        return true;
    }
};

int main(int argc, char **argv) {

    MyPipeline p1;
    p1.schedule_for_cpu();
    printf("Running pipeline on CPU:\n");
    printf("Test Correctness of cpu scheduler: %d\n",p1.test_correctness());

    MyPipeline p2;
    bool has_gpu_target = p2.schedule_for_gpu();
    printf("Running pipeline on GPU:\n");
    printf("Test Correctness of gpu scheduler: %d\n", p2.test_correctness());


    printf("Testing performance on CPU:\n");
    p1.test_performance();

    if (has_gpu_target) {
        printf("Testing performance on GPU:\n");
        p2.test_performance();
    }

    return 0;
}


Target find_gpu_target() {
    // Start with a target suitable for the machine you're running this on.
    Target target = get_host_target();

    // Uncomment the following lines to try CUDA instead:
    target.set_feature(Target::CUDA);
    // Enable debugging so that you can see what OpenCL API calls we do.
    //target.set_feature(Halide::Target::Debug);
    return target;
}

Output

Running pipeline on CPU:
Test Correctness of cpu scheduler: 1
Running pipeline on GPU:
Test Correctness of gpu scheduler: 1
Testing performance on CPU:
1.0000 milliseconds
Testing performance on GPU:
22.0000 milliseconds

I have tried running GPU scheduler with debug flag, time recorded is a below

CUDA: halide_cuda_initialize_kernels: 1.303033e+00 ms

CUDA: halide_cuda_device_malloc: 1.070443e+00 ms

CUDA: halide_cuda_run: 5.184570e+00 ms

CUDA: halide_cuda_buffer_copy : 7.340180e-01 ms

CUDA: halide_cuda_device_free : 1.317381e+00 ms

Edit 1: Is it possible with Halide to initialize gpu kernel and malloc/free only once, reuse the kernel for different inputs?

You should also be using the highest resolution clock available to you, or Halide's own benchmarking tools. See `tools/RunGenMain.cpp` — Alex Reinking, Nov 21 '19 at 01:57
Thanks, replaced with halide_benchmark() function in halide_benchmark.h and result was similar. — Dhruvesh Gajaria, Nov 21 '19 at 02:52

score 2 · Answer 1 · answered Nov 21 '19 at 00:45

This is likely bottlenecked on API overhead on the GPU. It is running only 1k points per iteration, which is not nearly enough to fill most GPUs, and doing only a single multiply and store per point. Then it is serializing kernel launch → copy to host. If you did the same thing in raw CUDA or OpenCL, this will still be far below peak performance.

To measure API overhead less, and raw compute more, try running a more complex kernel, for a longer period of time, and potentially also invoking the kernel multiple times before calling copy to host.

increased the COMPUTE_SIZE to 1048576, still GPU processing time is more than CPU, another algorithm that I am working on is more complex and i was not getting good performance with it using halide, hence tried with above simple test program — Dhruvesh Gajaria, Nov 21 '19 at 03:16

Halide GPU scheduler slower than CPU

1 Answers1