2

OpenCL-programs/kernels get build/compiled at runtime using the clBuildProgram() function. My program dynamically creates kernels to build and as such is spending a considerable amount of time compiling them. Of course, seeing that there are many kernels and they are completely independent from each other, I would like to split this work over multiple cores, as shown in the snippet below:

This person seems to have a very similar problem, but this was 6 years ago and the solution is not really satisfactory imo

ThreadPool tempPool = ThreadPool();
auto start = std::chrono::steady_clock::now();

for (int reps = 0; reps < 50; reps++) {
    tempPool.addJob([this] () {
        auto start = std::chrono::steady_clock::now();

        //These would hold the program sources
        std::vector<const char*> sources = {sourceCode.toRawUTF8()};
        std::vector<const size_t> sourceLengths = {sourceCode.getNumBytesAsUTF8()};

        cl_int ret;
        cl_program program = clCreateProgramWithSource(getCLContext()(), 1, sources.data(), sourceLengths.data(), &ret);

        // Build the program
        ret = clBuildProgram(program, 1, &getCLDevices()[0](), NULL, NULL, NULL);
        if (ret) {
            //Generic error checking
        }

        auto singleDuration = std::chrono::duration<double, std::milli>(std::chrono::steady_clock::now() - start).count();
    });
}

//Simple way to wait for all jobs to be finished
while (tempPool.getNumJobs() > 0) {
    Thread::sleep(1);
}

 auto totaDuration = std::chrono::duration <double, std::milli> (std::chrono::steady_clock::now() - start).count();

Everything I do using this ThreadPool setup results in a speedup of 5-6 (I have 8 threads), which is to be expected. However, building OpenCL-kernels does not. It seems as if there can only be one kernel building at the same time.

Is there a solution to this? I'm on MacOS atm, but I would also be interested in Linux/Windows.

If not, is there a way to build OpenCL-kernels which does not involve clBuildProgram(), but for example gcc or a similar solution?

me me
  • 395
  • 1
  • 3
  • 14

1 Answers1

2

(I am surprised that the driver for your platform isn't already multithreaded. Are you sure you're calls are really parallel.)

If you're still stuck, a wretched hack that might work for that extends the solution in your referenced question follows. For some drivers clCreateProgramWithBinaries is much faster. Hence,

  1. fork new processes (or call a helper executable that uses the same device set)
  2. each subprocess calls clCreateProgramWithSource and then clBuildProgram
  3. the children call clGetProgramInfo(...CL_PROGRAM_BINARIES...) to fetch the binary and then pass it back via file, pipe, or some other interprocess communication.

Again, I'd check that your setup code again first before duct taping this hack together.

Tim
  • 2,708
  • 1
  • 18
  • 32
  • Tbh I was surprised at that as well. Apple seems to have given up on OpenCL quite a while ago (it is even deprecated in the newer versions), so that might be one of the reasons. As such I'm also curious if Linux/Windows implementations suffer from the same problem, but I'm not able to test that atm. – me me Sep 25 '19 at 07:32
  • At least last time I tried it, Xcode actually had [built-in OpenCL precompilation support](https://developer.apple.com/library/archive/documentation/Performance/Conceptual/OpenCL_MacProgGuide/XCodeHelloWorld/XCodeHelloWorld.html) which takes care of some of the awkwardness of dealing with OpenCL kernel binaries. This isn't portable to other platforms, but it sounds like you don't really care about that. – pmdj Sep 25 '19 at 08:48
  • If you only need to target macOS 10.14 and newer and only care about running on GPUs, however, I strongly recommend switching to Metal compute kernels as those are much better supported. (macOS 10.14 dropped support for GPUs which don't support Metal, so using OpenCL doesn't get you any extra compatibility) – pmdj Sep 25 '19 at 08:49
  • Precompilation support wouldn't help me as all the kernels get created dynamically at runtime, based on (among other things) the user's input. Regarding Metal, I've also looked into that option but it is unfortunately insufficient atm (there is no FP64 support for example). OpenCL also leaves me the option to port the program with relative ease in the future. – me me Sep 25 '19 at 12:23