OpenMP is leaving a lot of cores idle

Question

I'm trying to use OMP to parallelize some code, and on my regular workstation, I got it working with 4 cores (the max that I had). All of my cores where at 100% during the processing, and it worked fast. However, now I'm moved my code to a server to process larger amount of data, and I updated the number of cores to 30 (the server has 32 cores).

However, when I start the process, it seems to properly starts the 30 threads, but most of my cores are at 0%, except for ~5 of them around 20%, but then the process is long. 'top' is reporting 145 %CPU for my process, when with 30 cores, it should be closer to 3000%.

Here is my parallelized code:

#pragma omp parallel for num_threads(N_CORES) schedule(static,1) shared(viImgCounts, viLabelCounts, vfLabelLogLikelihood, ppfMatchingVotes)
      for (int n = 0; n < N_CORES; ++n)
        {
          // Each thread process it's range of files
          for (int i = vpChunkStartEnd[n].first; i <= vpChunkStartEnd[n].second; i++)
            {
              printf("Searching %d of %d ... ", i, vvInputFeats.size());
              ProcessingData(i, viImgCounts, viLabelCounts, vfLabelLogLikelihood, ppfMatchingVotes, ppiLabelVotes);
              printf("\n");
            }
        }

Here I defined N_CORES as 30 (#define N_CORES 30).

Each thread has a range of files to process (vpChunkStartEnd), so in my understanding, they should all run together at 100%, processing their list of files. I'm not sure to understand why it is not happening here (while it was on my workstation with 4 cores).

Did I forget something in the #pragma?

Thank you.

Edit: Here is the output I get:

Searching 6750 of 7482 ... Searching 4000 of 7482 ... Searching 3750 of 7482 ... Searching 5000 of 7482 ... Searching 0 of 7482 ... Searching 500 of 7482 ... Searching 2250 of 7482 ... Searching 5500 of 7482 ... |0|Searching 5750 of 7482 ... |0|Searching 6000 of 7482 ... |0|Searching 7250 of 7482 ... |0|Searching 7000 of 7482 ... Searching 750 of 7482 ... Searching 3250 of 7482 ... Searching 1000 of 7482 ... |1||1|Searching 1500 of 7482 ... Searching 2750 of 7482 ... Searching 2000 of 7482 ... |1|Searching 5250 of 7482 ... |0||0|Searching 3500 of 7482 ... Searching 4500 of 7482 ... |0|Searching 1250 of 7482 ... |1|Searching 4750 of 7482 ... |0||1|Searching 6250 of 7482 ... |1|Searching 250 of 7482 ... |1|Searching 4250 of 7482 ... |0|Searching 1750 of 7482 ... |0||0||0||1||0||1||1|Searching 3000 of 7482 ... |0||0||1||0|Searching 2500 of 7482 ... |0|Searching 6500 of 7482 ... |1||0|
Searching 4001 of 7482 ... |0|
Searching 3001 of 7482 ... |0|
Searching 3251 of 7482 ... |0|
Searching 6751 of 7482 ... |1|
Searching 2251 of 7482 ... |0|
Searching 1751 of 7482 ... |0|
Searching 1501 of 7482 ... |0|
Searching 5501 of 7482 ... |1|
Searching 5001 of 7482 ... |1|
Searching 7001 of 7482 ... |1|
Searching 1251 of 7482 ... |0|
Searching 5251 of 7482 ... |1|
Searching 4501 of 7482 ... |1|
Searching 3751 of 7482 ... |0|
Searching 5751 of 7482 ... |1|
Searching 2751 of 7482 ... |0|
Searching 4002 of 7482 ... |0|
Searching 4751 of 7482 ... |1|
Searching 6251 of 7482 ... |1|
Searching 6501 of 7482 ... |1|
Searching 6001 of 7482 ... |1|
Searching 2001 of 7482 ... |0|
Searching 3501 of 7482 ... |0|
Searching 3252 of 7482 ... |0|

We can see that at the first iteration, text is all scrambled because multiple threads run in parallel, but after that, everything is printed properly, and most cores go idle, so it feels like each iteration is waiting for the previous one to finish to start. It could just be coincidence that they don't finish their loop at the same time, but this keep going for thousands of file, so it would be a huge coincidence, and also the fact that most core are idles, let me think that the other iterations are not parallelized properly.

Edit 2: Okay, something strang happened. As mentionned in the comment, 'top' was telling me I have no more free memory (barely 800MB), but the resource manager was only showing 50% used (out of 48GB). I started running a software using about 20GB, and I killed it. Now 'top' is showing me 20GB free memory (the memory used didn't move), and now my program is running at 1000%. So I wonder if maybe top was actually right about the free memory available, and that's what was slowing down my program, but I also wonder why starting a program using lots of memory and killing it freed that memory. Is this possible that the memory was gone due to a memory leak, and the program requested the memory so the leaked memory have been reused, and when the program have been killed, it returned it to the 'free' memory, making it available to the system ?

Edit 3: Okay, so it seems it wasn't the problem. The CPU usage is back to 350% (and most cores went idle again) and even top says I have 20GB of free memory. The CPU usage is so uneven, I don't get it.

Is it possible that because I used 'shared' with some variable, threads had to wait for other threads to finish with this variable to access it (like a mutex)? Because I made sure that no thread will write at the same location of this variable. Could this is why it slowed down my code?

Edit 4: Ok, it seems to use all my cores now that I got some memory back, but they're all at 30%, and this time I have plenty of free memory (few of them at 50%), but it's not crazy. Should I increase the number of threads? Or are they just gonna be queued (which won't improve the CPU usage) ?

Edit 5: I used htop to get some stats, and it seems I have an average of 10 running tasks at a time, which correspond to 30% CPU usage if I have 30 threads started (which is what I have). So it seems not all my threads are running all the time, but I don't understand why. I might need to profile it, but I tested with valgrind and it is wayyyy to slow to use it (I need to load a bunch of data first, and it not doable with valgrind).

Edit 6: So I wrote a dummy program, using the same omp parameters as I use in my code, and the same structure, but instead of calling a ProcessingData function, I just increment a variable, and it created about 30 running tasks and I get to 100% CPU usage. However, when I used my function, it creates between 5-15 running tasks, and I never get to 100%. I used htop to look at the status of my thread, and they spend lots ot time in 'D' status (uninterruptible sleep). I guess I will need to profile them to see why.

Are you maybe memory bound or similar? If you are talking about files, maybe your server disk is unable to serve all threads their files? You could add some debug output (e.g. involving `omp_get_thread_num()` and the file about to be processed) to see where things are hanging. — Max Langhof, Jul 31 '18 at 16:55
No, I'm not at 50% of the memory used, so I don't think it's that. Yes, I'm trying to use omp_get_threads_num right now to see what is going on. Although I'm confused about the memory used, because 'top' is saying I have 48GB RAM total, and almost nothing free, and 24GB swap (all free), but the resource manager tells me I have 48GB and 50% free, with also 24GB (all free), so I don't really understand who is showing what, and who is right. — whiteShadow, Jul 31 '18 at 16:57
Though I think top is wrong, because even though it says I don't have free memory left, my program keep using more and more, and the swap doesn't increase, and the free memory don't go to 0, so I guess I have some free memory somewhere, but not showing up in the 'top' free memory. — whiteShadow, Jul 31 '18 at 17:11
Unfortunately we cannot really help you without much more information about your application - basically a [mcve]. Otherwise there is just too much guessing. I would recommend you to use an OpenMP-aware performance analysis tool. Preferrably a trace-based one that allows to to investigate the dynamics, e.g. [Score-P](http://score-p.org) and [Vampir](https://vampir.eu). — Zulan, Jul 31 '18 at 19:56
@whiteShadow "Memory bound" does not mean "running out of memory" but "having your processing speed capped by how fast your system can transfer memory contents (e.g. from RAM to CPU). Similarly, a slow disk may not be able to supply file contents to tens of threads at the same time, so some/many of them may wait idly for file contents to arrive. — Max Langhof, Aug 01 '18 at 07:24
@MaxLanghof How could I check that? Is there a way to see if this is due to memory bound? — whiteShadow, Aug 01 '18 at 18:09
Watching the disk activity/usage while your program is running would be a start. For being memory bound (less likely to be the reason than the hard drive read speed) you would probably need a separate tool. I know Intel Amplifier can help, but other than that I would google just like you. — Max Langhof, Aug 02 '18 at 07:32
Thanks, I found atop, a great tool. Though I use 0% disk (everything is loaded into memory, and run from there after). I found that my CPUs are ~2000% idle, but 0% wait. I didn't see any bottleneck from that. I'm trying to check the memory bandwidth now. — whiteShadow, Aug 02 '18 at 21:14
I performed strace on my process, and it's doing tons of mmap munmap. Could this be an issue? — whiteShadow, Aug 02 '18 at 21:33

Alex Johnson · Answer 1 · 2018-07-31T16:42:24.187

0

We can't see the rest of your code, but you may want to try employing the following functions to figure out what's going on:

omp_get_num_threads() // get the number of threads operating in a parallel region

And:

omp_set_num_threads(numThreads) // set the number of threads used in a parallel region

That'll give you a better idea of processor utilization than just looking at a processor activity monitor.

Here are some other great answers that describe this in more detail:

Finally, it's important to remember that cores are not the same thing as threads. You probably know that, but just wanted to keep the terminology straight.

edited Jul 31 '18 at 16:42

answered Jul 31 '18 at 16:38

Alex Johnson

958
8
23

Thank you. I thought #pragma num_threads(n) was equivalent to omp_set_num_threads(n) (basically set OMP_NUM_THREADS to n). Is there a real difference? Yes, I know threads are different from cores, but I'd like to have 1 thread per core basically as they are doing independent work. – whiteShadow Jul 31 '18 at 16:42
@whiteShadow They're not the same. Check out that second link above. – Alex Johnson Jul 31 '18 at 16:45
2

Yes, I've seen this post, but for me "the presence of the num_threads clause overrides both other values." meant that they were equivalent (or at least that num_threads clause would set OMP_NUM_THREADS as omp_set_num_threads is doing). Am I wrong ? – whiteShadow Jul 31 '18 at 16:50
@AlexJohnson They are equivalent when used with the same value (before the parallel region, of course), according to your own link. – Max Langhof Jul 31 '18 at 16:51
1

@AlexJohnson Also, just want to mention that this is all I have related to OMP in my code, nothing else, but the whole code is very long, so I just posted the relevant part. – whiteShadow Jul 31 '18 at 16:56
@whiteShadow Sorry, should've been more clear. I suppose you could say that they can both be used to the same ends. What do you get with calls to `omp_get_num_threads()`? – Alex Johnson Jul 31 '18 at 16:59
I tried to add omp_set_num_threads(30) and omp_get_num_threads returns 30, but still, after the first iteration of each thread, most cores go idle. And even during first iteration, cores are at 20%. – whiteShadow Jul 31 '18 at 17:05
@whiteShadow Hmmm. You probably know this based on other research, but that sets an upper limit rather than explicitly setting the number of threads OpenMP will use. – Alex Johnson Jul 31 '18 at 17:09
But if I make a loop of N iterations, and set schedule(static,1) with N threads, it should dispatch each iteration to a new thread, right? – whiteShadow Jul 31 '18 at 17:12
Also, with the first iteration (as I have a printf in the parallelized region) the text are all a bit scrambled (normal), but after that, they're not anymore (so it may be a coincidence that threads don't finish their loop at the same time) but given that they don't use multiple cores, I feel like the first iteration is parallelized, but not the following ones. I feel like each thread waits for the previous to finish to start. – whiteShadow Jul 31 '18 at 17:17

OpenMP is leaving a lot of cores idle

1 Answers1