OpenCL-programs/kernels get build/compiled at runtime using the clBuildProgram() function. My program dynamically creates kernels to build and as such is spending a considerable amount of time compiling them. Of course, seeing that there are many kernels and they are completely independent from each other, I would like to split this work over multiple cores, as shown in the snippet below:
This person seems to have a very similar problem, but this was 6 years ago and the solution is not really satisfactory imo
ThreadPool tempPool = ThreadPool();
auto start = std::chrono::steady_clock::now();
for (int reps = 0; reps < 50; reps++) {
tempPool.addJob([this] () {
auto start = std::chrono::steady_clock::now();
//These would hold the program sources
std::vector<const char*> sources = {sourceCode.toRawUTF8()};
std::vector<const size_t> sourceLengths = {sourceCode.getNumBytesAsUTF8()};
cl_int ret;
cl_program program = clCreateProgramWithSource(getCLContext()(), 1, sources.data(), sourceLengths.data(), &ret);
// Build the program
ret = clBuildProgram(program, 1, &getCLDevices()[0](), NULL, NULL, NULL);
if (ret) {
//Generic error checking
}
auto singleDuration = std::chrono::duration<double, std::milli>(std::chrono::steady_clock::now() - start).count();
});
}
//Simple way to wait for all jobs to be finished
while (tempPool.getNumJobs() > 0) {
Thread::sleep(1);
}
auto totaDuration = std::chrono::duration <double, std::milli> (std::chrono::steady_clock::now() - start).count();
Everything I do using this ThreadPool setup results in a speedup of 5-6 (I have 8 threads), which is to be expected. However, building OpenCL-kernels does not. It seems as if there can only be one kernel building at the same time.
Is there a solution to this? I'm on MacOS atm, but I would also be interested in Linux/Windows.
If not, is there a way to build OpenCL-kernels which does not involve clBuildProgram(), but for example gcc or a similar solution?