Parallel code terribly slow when inside function, working fine standalone

Question

I am struggling with the parallel package. Part of the problem is that I am quite new to parallel computing and I lack a general understanding of what works and what doesn't (and why). So, apologies if what I am about to ask doesn't make sense from the outset or simply can't work in principle (that might well be).

I am trying to optimize a portfolio of securities that consists of individual sub portfolios. The sub portfolios are created independent from one-another, so this task should be suitable for a parallel approach (the portfolios are combined only at a later stage).

Currently I am using a serial approach, lapply takes care if it and it works just fine. The whole thing is wrapped in a function, whilst the wrapper doesn't really have a purpose beyond preparing the list upon which lapply will iterate, applying FUN.

The (serial) code looks as follows:

assemble_buckets    <-function(bucket_categories, ...) {

  optimize_bucket<-function(bucket_category, ...) {}

  SAA_results<-lapply(bucket_categories, FUN=optimize_bucket, ...)

  names(SAA_results)<-bucket_categories

  SAA_results

}

I am testing the performance using a simple loop.

a<-1000

for (n in 1:a) {

  if (n==1) {start_time<-Sys.time()}

  x<-assemble_buckets(bucket_categories, ...)

  if (n==a) {print(Sys.time()-start_time)}

}

Time for 1000 replications is ~19.78 mins - no too bad, but I need a quicker approach, because I want to let this run using a growing selection of securities.

So naturally, I d like to use a parallel approach. The (naïve) parallelized code using parLapply looks as follows (it really is my first attempt…):

assemble_buckets_p    <-function(cluster_nr, bucket_categories, ...) {

f1        <-function(...)

f2        <-function(...)

optimize_bucket_p     <-function(cluster_nr, bucket_categories, ...) {}

clusterExport(cluster_nr, varlist=list("optimize_bucket", "f1", "f2), envir=environment())

clusterCall(cluster_nr, function() library(...))

SAA_results<-parLapply(cluster_nr, bucket_categories, ...)

names(SAA_results)<-bucket_categories

SAA_results

}

f1 and f2 were previously wrapped inside the optimizer function, they are now outside because the whole thing runs significantly faster with them being separate (would also be interesting to know why that is).

I am again testing the performance using a similar loop structure.

cluster_nr<-makeCluster(min(detectCores(), length(bucket_categories)))

b<-1000

for (n in 1:b) {

  if (n==1) {start_time<-Sys.time()}

  x<-assemble_buckets2(cluster_nr,  bucket_categories, ...)

  if (n==b) {print(Sys.time()-start_time)}

}

Runtime here is significantly faster, 5.97 mins, so there is some improvement. As the portfolios grow larger, the benefits should increase further, so I conclude parallelization is worthwhile.

Now, I am trying to use the parallelized version of the function inside a wrapper. The wrapper function has multiple layers and basically is, at its top-level, a loop, rebalancing the whole portfolio (multiple assets classes) for a given point in time.

Here comes the problem: When I let this run, something weird happens. Whilst the parallelized version actually does seem to be working (execution doesn’t stop), it takes much much longer than the serial one, like a factor of 100 longer.

In fact, the parallel version takes so much longer, that it certainly takes way too long to be of any use. What puzzles me, is that - as said above - when I am using the optimizer function on a standalone basis, it actually seems to be working, and it keeps getting more enigmatic...

I have been trying to further isolate the issue since an earlier version of this question and I think I've made some progress. I wrapped my optimizer function into a self sufficient test function, called test_p().

test_p<-function() {
    
    a<-1
    for (n in 1:a) {
      if (n==1) {start_time<-Sys.time()}
      x<-assemble_buckets_p(...)
      if (n==a) {print(Sys.time()-start_time)}
    }
  }

test_p() returns its runtime using print() and I can put it anywhere in the multi-layered wrapper I want, the wrapper structure is as follows:

  optimize_SAA(...)        <-function() { [1]
  construct_portfolio(...) <-function() { [2]
  construct_assetclass(...)<-function() { [3]
  assemble_buckets(...)    <-function() { #note that this is were I initially wanted to put the parallel part

  }}}}

So now here's the thing: when I add test_p() to the [1] and [2] layers, it will work just as if it were standalone, it can't do anything useful there because it's in the wrong place, but it yields a result using multiple CPU cores within 0.636 secs.

As soon as I put it down to the [3] layer and below, executing the very same function takes 40 seconds. I really have tried everything that I could think of, but I have no idea why that is??

To sum it up, those would be my questions:

So has anyone an idea what might be the rootcause of this problem?
Why does the runtime of parallel code seem to depend on where the code sits?
Was there anything obvious that I could/should try to fix this?

Many thanks in advance!

Your question started off well, but it is not clear on what is issue is. Are you trying to parallelize a parallelize event. If you can clarify the last 1/3 of your question. — Dave2e, Feb 23 '22 at 02:04
Many thanks Dave for looking into this, I really appreciate your time. I have updated the question and I hope it's somewhat clearer and more concise (though it got longer in the process...). I was also able to eliminate the warnings(), they were not related to the issue. — Allrounder, Feb 23 '22 at 18:54
Without seeing your functions, it is impossible to provide a meaningful answer here. My first thought is you are trying to parallelize and parallelized routine and the overhead time is hurting your performance. I am going to vote this question close. I suggest deleting it and creating a new question providing more details what your routine is actually doing. — Dave2e, Feb 23 '22 at 23:25
I understand, but the whole code would be too long. I am not trying to parallelize already parallelized routines, these are (short) for-loops, going over individual asset classes, sectors, etc. The heavy-lifting is being done by the optimizer function, which is the only part that I tried to parallelize. Anyways, I think you have already helped, as it doesn't appear to be a known/common or super obvious issue, so thanks. — Allrounder, Feb 24 '22 at 06:04

Parallel code terribly slow when inside function, working fine standalone

0 Answers0