5

Common sense indicates that any computation should be faster the more cores or threads we use. If the scaling is bad, the computation time will not improve with increasing number of threads. Thus, how come increasing threads considerably reduces the computation time when fitting a gam with R package MGCV, as shown by this example? :

library(boot) # loads data "amis"

t1<-Sys.time()

mod <- gam(speed ~ s(period, warning, pair, k = 12), data = amis, family=tw (link = log),method="REML",control=list(nthreads=1))  # 

t2<-Sys.time()

print("Model fitted in:")
print(t2-t1)

If you increase the number of threads in this example to 2, 4, etc, the fitting procedure will take longer and longer, instead of being faster as we would expect. In my particular case:

1 thread: 32.85333 secs

2 threads: 50.63166 secs

3 threads: 1.2635 mins

Why is this? If I am doing something wrong, what can I do to obtain the desired behavior (i.e., increasing performance with increasing number of threads)?

Some notes:

1) The model, family and solving method shown here make no particular sense. This is only an example. However, I’ve got into this problem with real data and a reasonable model (but for simplicity I use this small code to exemplify the problem). Data, functional form of model, family, solving method seem all to be irrelevant: after many tests I get always the same behaviour, i.e., increasing the number of used threads, decreases performance (i.e., increases computation time).

2) Operative System: Linux Ubuntu 18.04;

3) Architecture: DELL Power Edge with two physical CPUs Intel Xeon X5660 each of them with 6 cores @2800 Mhz and each core being able of handling 2 threads (i.e., total of 24 threads). 80Gb RAM.

4) OpenMP libraries (which are needed for the multi-threath capacity of function gam) were installed with

sudo apt-get install libomp-dev

5) I am aware of the help page for multi-core use of gam (https://stat.ethz.ch/R-manual/R-devel/library/mgcv/html/mgcv-parallel.html). The only thing written there pointing to a decrease of performance with increasing number of threads is "Because the computational burden in mgcv is all in the linear algebra, then parallel computation may provide reduced (...) benefit with a tuned BLAS".

nukimov
  • 216
  • 1
  • 7
  • For this model, I'm not seeing much advantage at all from running with multiple threads. Giving the model 4 threads barely pushed the CPU usage for the main R thread above 120% on my system. Given that there is some set-up cost to fork and recombine everything, I doubt this is going to benefit from parallelization much (other models with more smoothers etc do see more benefit). I don't see the increase in compute time though; this seems to take about 24 seconds to fit on my 8 core Xeon with 1, 2, or 4 threads. (I don't have hyperthreading enabled.) – Gavin Simpson Jan 23 '19 at 19:29
  • Thanks for comment Gavin! As I wrote (note 1), the model and data are used here only as example of the problem, I though it is better if the example runs in few seconds or minutes than if it takes hours for anybody to reproduce the problem... My real problem has thousands of data, the model is far more complex and has to be fitted roughly 100 times, so I really need the parallel power, otherwise I will be computing one week instead of one day. Good to know that in your case the problem does not arise.I will check the issue you mention about hyperthreading, perhaps that is the solution. – nukimov Jan 24 '19 at 11:46
  • 2
    Sorry I wasn't clear; I have successfully used multiple threads to speed up GAMs with my set up & I can see **mgcv** using say all 6 cores that I give it. In this eg, it doesn't look like whatever computations are threaded (mgcv only uses multiple threads for some aspects of the fit) are not being used much here. Hence I would expect some increase in compute time because you have the pain of forking & then recombining for next to no computational gain once you've done it. That *is* what I see on my system. I wouldn't expect the slow down you see. Can you try without hyperthreading? – Gavin Simpson Jan 24 '19 at 15:23
  • 1
    Also, what happens if you try fitting the model with `bam()` which is designed to exploit parallel computation in different additional ways? – Gavin Simpson Jan 24 '19 at 15:24
  • Thanks again, I understand better the issue now. I will try to shut down hyperthreading as soon as possible, this might take some days. I also tried bam() already but it showed to be incompatible with the distribution I need (Tweedie). – nukimov Jan 24 '19 at 20:45
  • `bam()` should work with the `tw()` family. – Gavin Simpson Jan 25 '19 at 14:35
  • Then I will try it again, for some reason it didn't work before. – nukimov Jan 25 '19 at 16:35
  • Disabling multi-threading did not change the behavior with function gam(), i.e., I get the same: longer computation times for larger number of CPUs. Function bam() works indeed with this simple example but unfortunately with my "real" case I get an error (it seems to be a bug in function bam, but this is another story). Function bam() with this simple example works as described by Gavin, i.e., larger number of CPUs does not increase computation time, but at least it does not decreases, as is the case of gam(). Summarizing, I'm still puzzled about where my problem lies, I'll keep searching... – nukimov Jan 28 '19 at 11:00
  • 1
    I take back something I wrote in the question, i.e., that in the help page for multi-core use of gam there is nothing pointing to performance decrease. There is indeed a hint: "Because the computational burden in mgcv is all in the linear algebra, then parallel computation may provide reduced (...) benefit with a tuned BLAS". I found indeed that the chosen library for linear algebra plays a strong role in my problem. To switch between libraries you can type in the terminal: sudo update-alternatives --config libblas.so.3-x86_64-linux-gnu . OpenBLAS (manual mode) gave the expected behavior! – nukimov Jan 28 '19 at 12:09
  • 2
    I think we can summarize an answer for this question like this: A) Disable multi-threading (from the BIOS for instance); B) Use bam() instead of gam(); C) Use BLAS as linear algebra library instead of the default OpenBLAS (in my previous comment, which I cannot longer edit, I wrote the opposite, sorry for the mistake). D) Gavin's comment about gam() not really mult-threaning much of its computation work should be taken into account (also mentioned in the help page). Gavin, because you gave 3 of these 4 hints, I think it is fair if you formulate the answer, I will accept it as correct! – nukimov Jan 28 '19 at 12:23

0 Answers0