I have a program structure similarly to this:
ssize_t remain = nsamp;
while (!nsamp || remain > 0) {
#pragma omp parallel for num_threads(nthread)
for (ssize_t ii=0; ii < nthread; ii++) {
<generate noise>
}
// write noise
out.write(data, nthread*PERITER);
remain -= nthread*PERITER;
}
The problem is, when I benchmark the output of this, if I run with eg: two threads, sometimes it takes ~ the same time as a single thread, and sometimes I get a 2x speedup, it feels like there's some sort of synchronization race condition that I'm running into, sometimes I hit it and things go smoothly and sometimes (often) not.
Does anyone know what might be causing this and what the right way to parallelize a section inside of an outer while loop is?
Edit: Using strace, I see a lot of calls to sched_yield() This is probably making it look like I'm doing a lot on the CPU but I'm fighting the scheduler for a good scheduling pattern.