I'm trying to use OMP to parallelize some code, and on my regular workstation, I got it working with 4 cores (the max that I had). All of my cores where at 100% during the processing, and it worked fast. However, now I'm moved my code to a server to process larger amount of data, and I updated the number of cores to 30 (the server has 32 cores).
However, when I start the process, it seems to properly starts the 30 threads, but most of my cores are at 0%, except for ~5 of them around 20%, but then the process is long. 'top' is reporting 145 %CPU for my process, when with 30 cores, it should be closer to 3000%.
Here is my parallelized code:
#pragma omp parallel for num_threads(N_CORES) schedule(static,1) shared(viImgCounts, viLabelCounts, vfLabelLogLikelihood, ppfMatchingVotes)
for (int n = 0; n < N_CORES; ++n)
{
// Each thread process it's range of files
for (int i = vpChunkStartEnd[n].first; i <= vpChunkStartEnd[n].second; i++)
{
printf("Searching %d of %d ... ", i, vvInputFeats.size());
ProcessingData(i, viImgCounts, viLabelCounts, vfLabelLogLikelihood, ppfMatchingVotes, ppiLabelVotes);
printf("\n");
}
}
Here I defined N_CORES as 30 (#define N_CORES 30).
Each thread has a range of files to process (vpChunkStartEnd), so in my understanding, they should all run together at 100%, processing their list of files. I'm not sure to understand why it is not happening here (while it was on my workstation with 4 cores).
Did I forget something in the #pragma?
Thank you.
Edit: Here is the output I get:
Searching 6750 of 7482 ... Searching 4000 of 7482 ... Searching 3750 of 7482 ... Searching 5000 of 7482 ... Searching 0 of 7482 ... Searching 500 of 7482 ... Searching 2250 of 7482 ... Searching 5500 of 7482 ... |0|Searching 5750 of 7482 ... |0|Searching 6000 of 7482 ... |0|Searching 7250 of 7482 ... |0|Searching 7000 of 7482 ... Searching 750 of 7482 ... Searching 3250 of 7482 ... Searching 1000 of 7482 ... |1||1|Searching 1500 of 7482 ... Searching 2750 of 7482 ... Searching 2000 of 7482 ... |1|Searching 5250 of 7482 ... |0||0|Searching 3500 of 7482 ... Searching 4500 of 7482 ... |0|Searching 1250 of 7482 ... |1|Searching 4750 of 7482 ... |0||1|Searching 6250 of 7482 ... |1|Searching 250 of 7482 ... |1|Searching 4250 of 7482 ... |0|Searching 1750 of 7482 ... |0||0||0||1||0||1||1|Searching 3000 of 7482 ... |0||0||1||0|Searching 2500 of 7482 ... |0|Searching 6500 of 7482 ... |1||0|
Searching 4001 of 7482 ... |0|
Searching 3001 of 7482 ... |0|
Searching 3251 of 7482 ... |0|
Searching 6751 of 7482 ... |1|
Searching 2251 of 7482 ... |0|
Searching 1751 of 7482 ... |0|
Searching 1501 of 7482 ... |0|
Searching 5501 of 7482 ... |1|
Searching 5001 of 7482 ... |1|
Searching 7001 of 7482 ... |1|
Searching 1251 of 7482 ... |0|
Searching 5251 of 7482 ... |1|
Searching 4501 of 7482 ... |1|
Searching 3751 of 7482 ... |0|
Searching 5751 of 7482 ... |1|
Searching 2751 of 7482 ... |0|
Searching 4002 of 7482 ... |0|
Searching 4751 of 7482 ... |1|
Searching 6251 of 7482 ... |1|
Searching 6501 of 7482 ... |1|
Searching 6001 of 7482 ... |1|
Searching 2001 of 7482 ... |0|
Searching 3501 of 7482 ... |0|
Searching 3252 of 7482 ... |0|
We can see that at the first iteration, text is all scrambled because multiple threads run in parallel, but after that, everything is printed properly, and most cores go idle, so it feels like each iteration is waiting for the previous one to finish to start. It could just be coincidence that they don't finish their loop at the same time, but this keep going for thousands of file, so it would be a huge coincidence, and also the fact that most core are idles, let me think that the other iterations are not parallelized properly.
Edit 2: Okay, something strang happened. As mentionned in the comment, 'top' was telling me I have no more free memory (barely 800MB), but the resource manager was only showing 50% used (out of 48GB). I started running a software using about 20GB, and I killed it. Now 'top' is showing me 20GB free memory (the memory used didn't move), and now my program is running at 1000%. So I wonder if maybe top was actually right about the free memory available, and that's what was slowing down my program, but I also wonder why starting a program using lots of memory and killing it freed that memory. Is this possible that the memory was gone due to a memory leak, and the program requested the memory so the leaked memory have been reused, and when the program have been killed, it returned it to the 'free' memory, making it available to the system ?
Edit 3: Okay, so it seems it wasn't the problem. The CPU usage is back to 350% (and most cores went idle again) and even top says I have 20GB of free memory. The CPU usage is so uneven, I don't get it.
Is it possible that because I used 'shared' with some variable, threads had to wait for other threads to finish with this variable to access it (like a mutex)? Because I made sure that no thread will write at the same location of this variable. Could this is why it slowed down my code?
Edit 4: Ok, it seems to use all my cores now that I got some memory back, but they're all at 30%, and this time I have plenty of free memory (few of them at 50%), but it's not crazy. Should I increase the number of threads? Or are they just gonna be queued (which won't improve the CPU usage) ?
Edit 5: I used htop to get some stats, and it seems I have an average of 10 running tasks at a time, which correspond to 30% CPU usage if I have 30 threads started (which is what I have). So it seems not all my threads are running all the time, but I don't understand why. I might need to profile it, but I tested with valgrind and it is wayyyy to slow to use it (I need to load a bunch of data first, and it not doable with valgrind).
Edit 6: So I wrote a dummy program, using the same omp parameters as I use in my code, and the same structure, but instead of calling a ProcessingData function, I just increment a variable, and it created about 30 running tasks and I get to 100% CPU usage. However, when I used my function, it creates between 5-15 running tasks, and I never get to 100%. I used htop to look at the status of my thread, and they spend lots ot time in 'D' status (uninterruptible sleep). I guess I will need to profile them to see why.