4

I'm using the snow package in R to execute a function on a SOCK cluster with multiple machines(3) running on Linux OS. I tried to run the code with both parLapply and clusterApply.

In case of any error at the worker level, the results of the worker nodes are not returned properly to master making it very hard to debug. I'm currently logging every heartbeat of the worker nodes independently using futile.logger. It seems as if the results are properly computed. But when I tried to print the result at the master node (After receiving the output from workers) I get an error which says, Error in checkForRemoteErrors(val): 8 nodes produced errors; first error: missing value where TRUE/FALSE needed.

Is there any way to debug the results of the workers more deeply?

Steve Weston
  • 19,197
  • 4
  • 59
  • 75
  • First oder of business would be to run the code (with reduced number of iterations) without parallelization and debug. Have you done that? – Roland Jun 03 '13 at 12:04
  • 1
    Check that some of the workers aren't actually computing NA or NULL as their results. That sort of thing would log fine but the reduce or aggregate step will fail out when it tries to return to the master. The error you are seeing could be something like that. Can you compute sequentially and see the actual result of each batch or chunk? Also check traceback(). – Tommy Levi Jun 03 '13 at 14:28
  • @Roland: Thanks for your comment. Yes. I did that. It works fine without parallelization. Also, if it helps the workers are managed via SSH-passwordless login (using Auth keys). I am not able to reproduce this error. – pravinvenugopal Jun 04 '13 at 06:13
  • 1
    @TommyLevi: Thanks for your reply. I'm logging the results of the workers also. It is computed fine. And I cannot do a `traceback()` since, the R sessions that are created for the workers will be closed after the job is done. I don't want to keep the unwanted sessions alive. – pravinvenugopal Jun 04 '13 at 06:18
  • Is it possible to do a run sequentially on just your local machine? That could rule some things out. Does it fail out everytime on the workers? Or just some of the time? Also double check both your base R and any packages being used are the same version numbers – Tommy Levi Jun 04 '13 at 20:02
  • @TommyLevi: There is no problem in sequential case. It fails every time but at random times. Some of the times the code will run smoothly for a while and then throw this error. Some times earlier too. – pravinvenugopal Jun 05 '13 at 14:03
  • Are you deploying from the cluster/server itself? or from a local machine that farms out? – Tommy Levi Jun 05 '13 at 16:53

2 Answers2

11

The checkForRemoteErrors function is called by parLapply and clusterApply to check for task errors, and it will throw an error if any of the tasks failed. Unfortunately, although it displays the error message, it doesn't provide any information about what worker code caused the error. But if you modify your worker/task function to catch errors, you can retain some extra information that may be helpful in determining where the error occurred.

For example, here's a simple snow program that fails. Note that it uses outfile='' when creating the cluster so that output from the program is displayed, which by itself is a very useful debugging technique:

library(snow)
cl <- makeSOCKcluster(2, outfile='')
problem <- function(i) {
  if (NA)
    j <- 999
  else
    j <- i
  2 * j
}
r <- parLapply(cl, 1:2, problem)

When you execute this, you see the error message from checkForRemoteErrors and some other messages, but nothing that tells you that the if statement caused the error. To catch errors when calling problem, we define workerfun:

workerfun <- function(i) {
  tryCatch({
    problem(i)
  },
  error=function(e) {
    print(e)
    stop(e)
  })
}

Now we execute workerfun with parLapply instead of problem, first exporting problem to the workers:

clusterExport(cl, c('problem'))
r <- parLapply(cl, 1:2, workerfun)

Among the other messages, we now see

<simpleError in if (NA) j <- 999 else j <- i: missing value where TRUE/FALSE needed>

which includes the actual if statement that generated the error. Of course, it doesn't tell you the file name and line number of the expression, but it's often enough to let you solve the problem.

Steve Weston
  • 19,197
  • 4
  • 59
  • 75
  • 1
    I tried your above solution but it doesn't produce any additional error message. I know my underlying function works fine because I can run it using apply with no errors and the parallelisation works fine in another instance of the underlying function, I just get the checkForRemoteErrors(val): 4 nodes produced errors; first error: subscript out of bounds for a particular instance. Any suggestions? – Celeste Mar 15 '16 at 06:47
  • Im having a similar issue where the above works but only 1:2. 1:3 says one node fail, 1:4 says 2, 1:6 says 3 and 1:10 or above says 4 nodes. Think in my case its down to a limit on an api being accessed. – Oli Oct 27 '16 at 14:48
0

check the range of your observations. how the observation varies. I have noticed that when there are lots of decimal places 4, 5,6 , it throws glm.nb off. To solve this i just round the observations to 2 decimal places.