[Update:] I'm spawning multiple processes now and it works fairly well, though the basic threading problem still exists. [/]
I'm trying to thread a c++ (g++ 4.6.1) program that compiles a bunch of opencl kernels. Most of the time taken is spent inside clBuildProgram. (It's genetic programming and actually running the code and evaluating fitness is much much faster.) I'm trying to thread the compilation of these kernels and not having any luck so far. At this point, there's no shared data between threads (aside from having the same platform and device reference), but it will only run one thread at a time. I can run this code as several processes (just launching them in different terminal windows in linux) and it will then use up multiple cores but not within one process. I can use multiple cores with the same basic threading code (std::thread) with just basic math, so I think it's something to do with either the opencl compile or some static data I forgot about. :) Any ideas? I've done my best to make this thread-safe, so I'm stumped.
I'm using AMD's SDK (opencl 1.1, circa 6/13/2010) and a 5830 or 5850 to run it. The SDK and g++ are not as up to date as they could be. The last time I installed a newer linux distro in order to get the newer g++, my code was running at half speed (at least the opencl compiles were), so I went back. (Just checked the code on that install and it runs at half speed still with no threading differences.) Also, when I said it only runs one thread at a time, it will launch all of them and then alternate between two until they finish, then do the next two, etc. And it does look like all of the threads are running until the code gets to building the program. I'm not using a callback function in clBuildProgram. I realize there's a lot that could be going wrong here and it's hard to say without the code. :)
I am pretty sure this problem occurs inside of or in the call of clBuildProgram. I'm printing the time taken inside of here and the threads that get postponed will come back with a long compile time for their first compile. The only shared data between these clBuildProgram calls is the device id, in that each thread's cl_device_id has the same value.
This is how I'm launching the threads:
for (a = 0; a < num_threads; a++) {
threads[a] = std::thread(std::ref(programs[a]));
threads[a].detach();
sleep(1); // giving the opencl init f()s time to complete
}
This is where it's bogging down (and these are all local variables being passed, though the device id will be the same):
clBuildProgram(program, 1, & device, options, NULL, NULL);
It doesn't seem to make a difference whether each thread has a unique context or command_queue. I really suspected this was the problem which is why I mention it. :)
Update: Spawning child processes with fork() will work for this.